[jira] [Created] (GEARPUMP-47) display task location on dashboard

2016-04-19 Thread Manu Zhang (JIRA)
Manu Zhang created GEARPUMP-47:
--

 Summary: display task location on dashboard
 Key: GEARPUMP-47
 URL: https://issues.apache.org/jira/browse/GEARPUMP-47
 Project: Apache Gearpump
  Issue Type: Improvement
  Components: Dashboard
Reporter: Manu Zhang
Priority: Minor


moved from [https://github.com/gearpump/gearpump/issues/1987]. We already have 
task lists in the "ExecutorSummary" and executor location is available. It 
won't be hard to display task location on dashboard. That will help users to 
locate task logs on exceptions. 

{code}
  case class ExecutorSummary(
id: Int,
workerId: Int,
actorPath: String,
logFile: String,
status: String,
taskCount: Int,
tasks: Map[ProcessorId, List[TaskId]],
jvmName: String
  )
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (GEARPUMP-46) track serialization / deserialization latency and display on dashboard

2016-04-19 Thread Manu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/GEARPUMP-46?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manu Zhang updated GEARPUMP-46:
---
Description: moved from [https://github.com/gearpump/gearpump/issues/870]. 
serialization / deserialization could have a big impact on performance so it 
would be valuable to have such metrics. We neither need to track all messages 
but probe some of them nor all the time but on demand.  (was: serialization / 
deserialization could have a big impact on performance so it would be valuable 
to have such metrics. We neither need to track all messages but probe some of 
them nor all the time but on demand.)

> track serialization / deserialization latency and display on dashboard
> --
>
> Key: GEARPUMP-46
> URL: https://issues.apache.org/jira/browse/GEARPUMP-46
> Project: Apache Gearpump
>  Issue Type: Improvement
>  Components: Dashboard
>    Reporter: Manu Zhang
>Priority: Minor
>
> moved from [https://github.com/gearpump/gearpump/issues/870]. serialization / 
> deserialization could have a big impact on performance so it would be 
> valuable to have such metrics. We neither need to track all messages but 
> probe some of them nor all the time but on demand.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (GEARPUMP-46) track serialization / deserialization latency and display on dashboard

2016-04-19 Thread Manu Zhang (JIRA)
Manu Zhang created GEARPUMP-46:
--

 Summary: track serialization / deserialization latency and display 
on dashboard
 Key: GEARPUMP-46
 URL: https://issues.apache.org/jira/browse/GEARPUMP-46
 Project: Apache Gearpump
  Issue Type: Improvement
  Components: Dashboard
Reporter: Manu Zhang
Priority: Minor


serialization / deserialization could have a big impact on performance so it 
would be valuable to have such metrics. We neither need to track all messages 
but probe some of them nor all the time but on demand.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (GEARPUMP-40) KafkaSource poor performance

2016-04-19 Thread Manu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/GEARPUMP-40?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manu Zhang updated GEARPUMP-40:
---
Description: 
As per 
[https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines],
 a single thread Kafka consumer could consume at 89.7MB/s from 6x partition 3x 
replica topic whose data are evenly distributed on a 3-node GbE cluster. I 
carried out a similar experiment with KafkaSource and found that the throughput 
is only at 10MB/s.  



  was:
As per 
[https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines],
 a single thread Kafka consumer could consume at 89.7MB/s from 6x partition 3x 
replica topic. I carried out a similar experiment with KafkaSource and found 
that the throughput is only at 10MB/s.  




> KafkaSource poor performance
> 
>
> Key: GEARPUMP-40
> URL: https://issues.apache.org/jira/browse/GEARPUMP-40
> Project: Apache Gearpump
>  Issue Type: Improvement
>  Components: kafka
>Affects Versions: 0.8.0
>    Reporter: Manu Zhang
>Assignee: Manu Zhang
>
> As per 
> [https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines],
>  a single thread Kafka consumer could consume at 89.7MB/s from 6x partition 
> 3x replica topic whose data are evenly distributed on a 3-node GbE cluster. I 
> carried out a similar experiment with KafkaSource and found that the 
> throughput is only at 10MB/s.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: VOTE: Accept additional code commit from github.com/gearpump after initial code import

2016-04-16 Thread Manu Zhang
+1


On Sunday, April 17, 2016, Andrew Purtell  wrote:

> Please hold votes open for at least 72 hours.
> http://www.apache.org/foundation/voting.html
>
> *Votes should generally be permitted to run for at least 72 hours to
> provide an opportunity for all concerned persons to participate regardless
> of their geographic locations.*
>
>
>
> On Sat, Apr 16, 2016 at 5:02 AM, Sean Zhong  > wrote:
>
> > This is a call to vote accepting additional code commits from
> > github.com/gearpump after initial code importing to Apache:
> >
> > The code change covered by this vote is between:
> > Start sha1: d5343681edde427022bf4c225ce6022a6904ae88
> > End Sha1: 79e15668c6498a54da691626dd2f91bf205d3720
> >
> > * This vote will be open for at least 48 hours.*
> >
> > [ ] +1 Accept these commits since they are already reviewed by commiters
> in
> > these pull request.
> > [ ]  0 No opinion
> > [ ] -1 Do not accept these commits because...
> >
> > *Original discussion:*
> >
> >
> http://mail-archives.apache.org/mod_mbox/incubator-gearpump-dev/201604.mbox/browser
> >
> > *List of changes:*
> >
> > GEARPUMP-11, fix code style
> > https://github.com/gearpump/gearpump/pull/2032
> >
> > GEARPUMP-17, fix KafkaStorage lookup timestamp
> > https://github.com/gearpump/gearpump/pull/2030
> >
> > GEARPUMP-10: Downgrade netty from Netty 4 to Netty 3.8 cause the OAuth2
> > authentication failure
> > https://github.com/gearpump/gearpump/pull/2029
> >
> > GEARPUMP-8, fix "two machines can possibly have same worker id "
> > https://github.com/gearpump/gearpump/pull/2028
> >
> > GEARPUMP-9, Clean and fix integration test
> > https://github.com/gearpump/gearpump/pull/2028
> >
> > GEARPUMP-6: show add/remove worker buttons for admin
> > https://github.com/gearpump/gearpump/pull/2026
> >
> > GEARPUMP-5, Add additional authorization check like checking
> > user-organization for cloudfoundry OAuth2 Authenticator.
> > https://github.com/gearpump/gearpump/pull/2025
> >
> > GEARPUMP-3, Define REST API to add/remove worker instances, which allow
> us
> > to scale out in YARN
> > https://github.com/gearpump/gearpump/pull/2024
> >
> > GEARPUMP-2, Define REST API to submit job jar
> > https://github.com/gearpump/gearpump/pull/2023
> >
> > fix #1988, upgrade akka to akka 2.4.2
> > https://github.com/gearpump/gearpump/pull/2017
> >
> > fix #2015, do not send AckRequest or LatencyProbe when there is no
> pending
> > message in the channel
> > https://github.com/gearpump/gearpump/pull/2016
> >
> > fix #1943 allow user to config how many executors to use in an
> application
> > https://github.com/gearpump/gearpump/pull/1951
> >
> > fix #1641, add exactly-once integration test
> > https://github.com/gearpump/gearpump/pull/2012
> >
> > fix #1318, fix MinClock not updated fast enough for slow stream
> > https://github.com/gearpump/gearpump/pull/1705
> >
> > fix #1981, Support OAuth2 Social login
> > https://github.com/gearpump/gearpump/pull/2005
> >
> > fix #2007, add Java DSL
> > https://github.com/gearpump/gearpump/pull/2008
> >
> > fix #2002, add akka stream examples
> > https://github.com/gearpump/gearpump/pull/2003
> >
>
>
>
> --
> Best regards,
>
>- Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
>


[jira] [Updated] (GEARPUMP-33) Message Delivery Guarantee

2016-04-14 Thread Manu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/GEARPUMP-33?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manu Zhang updated GEARPUMP-33:
---
Description: 
The original discussions are at 
[https://github.com/gearpump/gearpump/issues/1528] and 
[https://github.com/gearpump/gearpump/issues/354].

When a message flows through a stream processing system, the system will try to 
provide some guarantee on message delivery From the weakest to strongest, there 
are.

# At most once delivery
  a message is processed zero or one times. Messages can be lost. 

# At least once delivery
   a message is processed one or more times such that at least one of them 
succeeds. Messages can not be lost but can be duplicated.

# Exactly once delivery
  a message is processed exactly once. Messages can neither be lost nor 
duplicated.

Gearpump tracks message loss between a sender Task and a receiver Task and 
replays the application on message loss. If the source is TimeReplayable, then 
at-least-once delivery can be guaranteed. In addition, if user state is stored 
through PersistentState API, then exactly-once delivery is guaranteed. 
Otherwise, at-most-once delivery is guaranteed. 

There are several limitations with the current implementation. 

1. If users only require at-most-once delivery, message loss track is not 
necessary and we may get better performance without it. 
2. We require user's data source to be TimeReplayable for 
at-least-once/exactly-once delivery. It would be better if we provide a 
TimeReplayable wrapper when user source is not replayable (e.g. Twitter)
3.  Further, it would be nice if we allow users to switch between the different 
guarantees through APIs or dashboard.

This jira is to gather requirements and ideas from the community and users. The 
real work will be divided into subtasks and committed step by step. 
  

  was:
The original discussions are at 
[https://github.com/gearpump/gearpump/issues/1528] and 
[https://github.com/gearpump/gearpump/issues/354].

When a message flows through a stream processing system, the system will try to 
provide some guarantee on message delivery From the weakest to strongest, there 
are.

# At most once delivery
  a message is processed zero or one times. Messages can be lost. 

# At least once delivery
   a message is processed one or more times such that at least one of them 
succeeds. Messages can not be lost but can be duplicated.

# Exactly once delivery
  a message is processed exactly once. Messages can neither be lost nor 
duplicated.

Gearpump tracks message loss between a sender Task and a receiver Task and 
replays the application on message loss. If the source is TimeReplayable, then 
at-least-once delivery can be guaranteed. In addition, if user state is stored 
through PersistentState API, then exactly-once delivery is guaranteed. 
Otherwise, at-most-once delivery is guaranteed. 

There are several limitations with the current implementation. 

1. If users only require at-most-once delivery, message loss track is not 
necessary and we may get better performance without it. 
2. We require user's data source to be TimeReplayable for 
at-least-once/exactly-once delivery. It would be better if we provide a 
TimeReplayable wrapper when user source is not replayable (e.g. Twitter)
3.  Further, it will be nice if we allow users to switch between the different 
guarantees through APIs or dashboard.

This jira is to gather requirements and ideas from the community and users. The 
real work will be divided into subtasks and committed step by step. 
  


> Message Delivery Guarantee
> --
>
> Key: GEARPUMP-33
> URL: https://issues.apache.org/jira/browse/GEARPUMP-33
> Project: Apache Gearpump
>  Issue Type: Improvement
>  Components: streaming
>    Reporter: Manu Zhang
>
> The original discussions are at 
> [https://github.com/gearpump/gearpump/issues/1528] and 
> [https://github.com/gearpump/gearpump/issues/354].
> When a message flows through a stream processing system, the system will try 
> to provide some guarantee on message delivery From the weakest to strongest, 
> there are.
> # At most once delivery
>   a message is processed zero or one times. Messages can be lost. 
> # At least once delivery
>a message is processed one or more times such that at least one of them 
> succeeds. Messages can not be lost but can be duplicated.
> # Exactly once delivery
>   a message is processed exactly once. Messages can neither be lost nor 
> duplicated.
> Gearpump tracks message loss between a sender Task and a receiver Task and 
> replays the application on message loss. If the source is TimeReplayable, 
> then at-least-once delivery can be guaranteed. In addition, if user state is 
> stored through PersistentState API, then ex

[jira] [Created] (GEARPUMP-33) Message Delivery Guarantee

2016-04-14 Thread Manu Zhang (JIRA)
Manu Zhang created GEARPUMP-33:
--

 Summary: Message Delivery Guarantee
 Key: GEARPUMP-33
 URL: https://issues.apache.org/jira/browse/GEARPUMP-33
 Project: Apache Gearpump
  Issue Type: Improvement
  Components: streaming
Reporter: Manu Zhang


The original discussions are at 
[https://github.com/gearpump/gearpump/issues/1528] and 
[https://github.com/gearpump/gearpump/issues/354].

When a message flows through a stream processing system, the system will try to 
provide some guarantee on message delivery From the weakest to strongest, there 
are.

# At most once delivery
  a message is processed zero or one times. Messages can be lost. 

# At least once delivery
   a message is processed one or more times such that at least one of them 
succeeds. Messages can not be lost but can be duplicated.

# Exactly once delivery
  a message is processed exactly once. Messages can neither be lost nor 
duplicated.

Gearpump tracks message loss between a sender Task and a receiver Task and 
replays the application on message loss. If the source is TimeReplayable, then 
at-least-once delivery can be guaranteed. In addition, if user state is stored 
through PersistentState API, then exactly-once delivery is guaranteed. 
Otherwise, at-most-once delivery is guaranteed. 

There are several limitations with the current implementation. 

1. If users only require at-most-once delivery, message loss track is not 
necessary and we may get better performance without it. 
2. We require user's data source to be TimeReplayable for 
at-least-once/exactly-once delivery. It would be better if we provide a 
TimeReplayable wrapper when user source is not replayable (e.g. Twitter)
3.  Further, it will be nice if we allow users to switch between the different 
guarantees through APIs or dashboard.

This jira is to gather requirements and ideas from the community and users. The 
real work will be divided into subtasks and committed step by step. 
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (GEARPUMP-24) refactor DataSource API

2016-04-14 Thread Manu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/GEARPUMP-24?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15242429#comment-15242429
 ] 

Manu Zhang edited comment on GEARPUMP-24 at 4/15/16 4:50 AM:
-

Hi [~whjiang],

For KafkaSource, we already have an internal queue for messages asyncly fetched 
from Kafka. I feel more like allowing users to access the internal queue 
directly than copy messages to another in memory queue.  This may be applied to 
other sources and I think an iterator like interface is more general and safer.

Hi [~clockfly],

Users still do a batch of reads in one Task invocation and data sources are 
free to pull in data in batch or not.  I'll perform a benchmark on KafkaSource. 


was (Author: mauzhang):
Hi [~whjiang],

For KafkaSource, we already have an internal queue for messages asyncly fetched 
from Kafka. I feel more like allowing users to access the internal queue 
directly than copy messages to another in memory queue.  This may be applied to 
other sources.

Hi [~clockfly],

Users still do a batch of reads in one Task invocation and data sources are 
free to pull in data in batch or not.  I'll perform a benchmark on KafkaSource. 

> refactor DataSource API
> ---
>
> Key: GEARPUMP-24
> URL: https://issues.apache.org/jira/browse/GEARPUMP-24
> Project: Apache Gearpump
>  Issue Type: Improvement
>  Components: streaming
>Affects Versions: 0.8.0
>    Reporter: Manu Zhang
>Assignee: Manu Zhang
> Fix For: 0.8.1
>
>
> From [https://github.com/gearpump/gearpump/issues/2013]:
> The current DataSource API
> {code}
> trait DataSource extends java.io.Serializable {
>   /**
>* open connection to data source
>* invoked in onStart() method of 
> [[io.gearpump.streaming.source.DataSourceTask]]
>* @param context is the task context at runtime
>* @param startTime is the start time of system
>*/
>   def open(context: TaskContext, startTime: Option[TimeStamp]): Unit
>   /**
>* read a number of messages from data source.
>* invoked in each onNext() method of 
> [[io.gearpump.streaming.source.DataSourceTask]]
>* @param batchSize max number of messages to read
>* @return a list of messages wrapped in [[io.gearpump.Message]]
>*/
>   def read(batchSize: Int): List[Message]
>   /**
>* close connection to data source.
>* invoked in onStop() method of 
> [[io.gearpump.streaming.source.DataSourceTask]]
>*/
>   def close(): Unit
> }
> {code}
> has several issues
> 1. read returns a scala list of Message which is unfriendly to Java 
> DataSources. Same for Option parameter in open
> 2. the number of read messages may not be the same as the passed in batchSize 
> which leaves uncertainty to users (users may access out of boundary list 
> positions)
> 3. to return a list an extra buffer could be needed in read (e.g. 
> KafkaSource) which is not best for performance
> Update:
> I'd like to add a "getWatermark" interface that helps to fix GEARPUMP-32



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (GEARPUMP-24) refactor DataSource API

2016-04-14 Thread Manu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/GEARPUMP-24?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15242429#comment-15242429
 ] 

Manu Zhang edited comment on GEARPUMP-24 at 4/15/16 4:47 AM:
-

Hi [~whjiang],

For KafkaSource, we already have an internal queue for messages asyncly fetched 
from Kafka. I feel more like allowing users to access the internal queue 
directly than copy messages to another in memory queue.  This may be applied to 
other sources.

Hi [~clockfly],

Users still do a batch of reads in one Task invocation and data sources are 
free to pull in data in batch or not.  I'll perform a benchmark on KafkaSource. 


was (Author: mauzhang):
[~whjiang] 

For KafkaSource, we already have an internal queue for messages asyncly fetched 
from Kafka. I feel more like allowing users to access the internal queue 
directly than copy messages to another in memory queue.  This may be applied to 
other sources.

[~clockfly]

Users still do a batch of reads in one Task invocation and data sources are 
free to pull in data in batch or not.  I'll perform a benchmark on KafkaSource. 

> refactor DataSource API
> ---
>
> Key: GEARPUMP-24
> URL: https://issues.apache.org/jira/browse/GEARPUMP-24
> Project: Apache Gearpump
>  Issue Type: Improvement
>  Components: streaming
>Affects Versions: 0.8.0
>    Reporter: Manu Zhang
>Assignee: Manu Zhang
> Fix For: 0.8.1
>
>
> From [https://github.com/gearpump/gearpump/issues/2013]:
> The current DataSource API
> {code}
> trait DataSource extends java.io.Serializable {
>   /**
>* open connection to data source
>* invoked in onStart() method of 
> [[io.gearpump.streaming.source.DataSourceTask]]
>* @param context is the task context at runtime
>* @param startTime is the start time of system
>*/
>   def open(context: TaskContext, startTime: Option[TimeStamp]): Unit
>   /**
>* read a number of messages from data source.
>* invoked in each onNext() method of 
> [[io.gearpump.streaming.source.DataSourceTask]]
>* @param batchSize max number of messages to read
>* @return a list of messages wrapped in [[io.gearpump.Message]]
>*/
>   def read(batchSize: Int): List[Message]
>   /**
>* close connection to data source.
>* invoked in onStop() method of 
> [[io.gearpump.streaming.source.DataSourceTask]]
>*/
>   def close(): Unit
> }
> {code}
> has several issues
> 1. read returns a scala list of Message which is unfriendly to Java 
> DataSources. Same for Option parameter in open
> 2. the number of read messages may not be the same as the passed in batchSize 
> which leaves uncertainty to users (users may access out of boundary list 
> positions)
> 3. to return a list an extra buffer could be needed in read (e.g. 
> KafkaSource) which is not best for performance
> Update:
> I'd like to add a "getWatermark" interface that helps to fix GEARPUMP-32



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (GEARPUMP-24) refactor DataSource API

2016-04-14 Thread Manu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/GEARPUMP-24?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15242429#comment-15242429
 ] 

Manu Zhang commented on GEARPUMP-24:


[~whjiang] 

For KafkaSource, we already have an internal queue for messages asyncly fetched 
from Kafka. I feel more like allowing users to access the internal queue 
directly than copy messages to another in memory queue.  This may be applied to 
other sources.

[~clockfly]

Users still do a batch of reads in one Task invocation and data sources are 
free to pull in data in batch or not.  I'll perform a benchmark on KafkaSource. 

> refactor DataSource API
> ---
>
> Key: GEARPUMP-24
> URL: https://issues.apache.org/jira/browse/GEARPUMP-24
> Project: Apache Gearpump
>  Issue Type: Improvement
>  Components: streaming
>Affects Versions: 0.8.0
>    Reporter: Manu Zhang
>Assignee: Manu Zhang
> Fix For: 0.8.1
>
>
> From [https://github.com/gearpump/gearpump/issues/2013]:
> The current DataSource API
> {code}
> trait DataSource extends java.io.Serializable {
>   /**
>* open connection to data source
>* invoked in onStart() method of 
> [[io.gearpump.streaming.source.DataSourceTask]]
>* @param context is the task context at runtime
>* @param startTime is the start time of system
>*/
>   def open(context: TaskContext, startTime: Option[TimeStamp]): Unit
>   /**
>* read a number of messages from data source.
>* invoked in each onNext() method of 
> [[io.gearpump.streaming.source.DataSourceTask]]
>* @param batchSize max number of messages to read
>* @return a list of messages wrapped in [[io.gearpump.Message]]
>*/
>   def read(batchSize: Int): List[Message]
>   /**
>* close connection to data source.
>* invoked in onStop() method of 
> [[io.gearpump.streaming.source.DataSourceTask]]
>*/
>   def close(): Unit
> }
> {code}
> has several issues
> 1. read returns a scala list of Message which is unfriendly to Java 
> DataSources. Same for Option parameter in open
> 2. the number of read messages may not be the same as the passed in batchSize 
> which leaves uncertainty to users (users may access out of boundary list 
> positions)
> 3. to return a list an extra buffer could be needed in read (e.g. 
> KafkaSource) which is not best for performance
> Update:
> I'd like to add a "getWatermark" interface that helps to fix GEARPUMP-32



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (GEARPUMP-24) refactor DataSource API

2016-04-14 Thread Manu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/GEARPUMP-24?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manu Zhang updated GEARPUMP-24:
---
Description: 
>From [https://github.com/gearpump/gearpump/issues/2013]:

The current DataSource API

{code}
trait DataSource extends java.io.Serializable {

  /**
   * open connection to data source
   * invoked in onStart() method of 
[[io.gearpump.streaming.source.DataSourceTask]]
   * @param context is the task context at runtime
   * @param startTime is the start time of system
   */
  def open(context: TaskContext, startTime: Option[TimeStamp]): Unit

  /**
   * read a number of messages from data source.
   * invoked in each onNext() method of 
[[io.gearpump.streaming.source.DataSourceTask]]
   * @param batchSize max number of messages to read
   * @return a list of messages wrapped in [[io.gearpump.Message]]
   */
  def read(batchSize: Int): List[Message]

  /**
   * close connection to data source.
   * invoked in onStop() method of 
[[io.gearpump.streaming.source.DataSourceTask]]
   */
  def close(): Unit
}
{code}

has several issues

1. read returns a scala list of Message which is unfriendly to Java 
DataSources. Same for Option parameter in open
2. the number of read messages may not be the same as the passed in batchSize 
which leaves uncertainty to users (users may access out of boundary list 
positions)
3. to return a list an extra buffer could be needed in read (e.g. KafkaSource) 
which is not best for performance

Update:

I'd like to add a "getWatermark" interface that helps to fix GEARPUMP-32

  was:
>From [https://github.com/gearpump/gearpump/issues/2013]:

The current DataSource API

{code}
trait DataSource extends java.io.Serializable {

  /**
   * open connection to data source
   * invoked in onStart() method of 
[[io.gearpump.streaming.source.DataSourceTask]]
   * @param context is the task context at runtime
   * @param startTime is the start time of system
   */
  def open(context: TaskContext, startTime: Option[TimeStamp]): Unit

  /**
   * read a number of messages from data source.
   * invoked in each onNext() method of 
[[io.gearpump.streaming.source.DataSourceTask]]
   * @param batchSize max number of messages to read
   * @return a list of messages wrapped in [[io.gearpump.Message]]
   */
  def read(batchSize: Int): List[Message]

  /**
   * close connection to data source.
   * invoked in onStop() method of 
[[io.gearpump.streaming.source.DataSourceTask]]
   */
  def close(): Unit
}
{code}

has several issues

1. read returns a scala list of Message which is unfriendly to Java 
DataSources. Same for Option parameter in open
2. the number of read messages may not be the same as the passed in batchSize 
which leaves uncertainty to users (users may access out of boundary list 
positions)
3. to return a list an extra buffer could be needed in read (e.g. KafkaSource) 
which is not best for performance


> refactor DataSource API
> ---
>
> Key: GEARPUMP-24
> URL: https://issues.apache.org/jira/browse/GEARPUMP-24
> Project: Apache Gearpump
>  Issue Type: Improvement
>  Components: streaming
>Affects Versions: 0.8.0
>    Reporter: Manu Zhang
>Assignee: Manu Zhang
> Fix For: 0.8.1
>
>
> From [https://github.com/gearpump/gearpump/issues/2013]:
> The current DataSource API
> {code}
> trait DataSource extends java.io.Serializable {
>   /**
>* open connection to data source
>* invoked in onStart() method of 
> [[io.gearpump.streaming.source.DataSourceTask]]
>* @param context is the task context at runtime
>* @param startTime is the start time of system
>*/
>   def open(context: TaskContext, startTime: Option[TimeStamp]): Unit
>   /**
>* read a number of messages from data source.
>* invoked in each onNext() method of 
> [[io.gearpump.streaming.source.DataSourceTask]]
>* @param batchSize max number of messages to read
>* @return a list of messages wrapped in [[io.gearpump.Message]]
>*/
>   def read(batchSize: Int): List[Message]
>   /**
>* close connection to data source.
>* invoked in onStop() method of 
> [[io.gearpump.streaming.source.DataSourceTask]]
>*/
>   def close(): Unit
> }
> {code}
> has several issues
> 1. read returns a scala list of Message which is unfriendly to Java 
> DataSources. Same for Option parameter in open
> 2. the number of read messages may not be the same as the passed in batchSize 
> which leaves uncertainty to users (users may access out of boundary list 
> positions)
> 3. to return a list an extra buffer could be needed in read (e.g. 
> KafkaSource) which is not best for performance
> Update:
> I'd like to add a "getWatermark" interface that helps to fix GEARPUMP-32



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (GEARPUMP-32) Minimum clock of source Tasks maybe inaccurate

2016-04-14 Thread Manu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/GEARPUMP-32?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manu Zhang updated GEARPUMP-32:
---
Description: 
Moved from [https://github.com/gearpump/gearpump/issues/1835] and reported by 
[Zhu Yueqian|https://github.com/yueqianzhu]

{quote}
Source tasks have not any upstreamClocks. So, startClock is the minimum of 
pending clocks when recover happen.

eg below:
source task1: timeStamp:15,not ACK, minClockValue maybe is 15(<= 15).
source task2: timeStamp:10,ACKed, minClockValue maybe is Long.MaxValue
when recover happen,startClock maybe is 15. where is the data between 10 to 15 
at task2?
{quote}

More context on this issue:

In Gearpump, we maintain a global minimum clock tracked from a message's 
timestamp across all tasks. It means messages with timestamp before this clock 
have all been processed. An application will restart from this value on 
failure, and thus at-least-once message delivery could be guaranteed. 

The global minimum clock is the lower bound of all the Tasks' minimum clocks. 
For a task, the minimum clock is the lower of 

# upstream minimum clock
# a. the minimum timestamp of unacked messages
   b. Long.MaxValue if all messages have been acked.
 
Note that 2.b allows the global minimum clock to progress and it is almost safe 
since the clock is also bounded by the upstream minimum clock. I said "almost 
safe" because a source task has no upstream but we assume the upstream minimum 
clock is Long.MaxValue. Thus, the scenario described by Zhu Yueqian could 
happen and breaks at-least-once guarantee. 





  was:
Moved from [https://github.com/gearpump/gearpump/issues/1835] and reported by 
[Zhu Yueqian|https://github.com/yueqianzhu]

{quote}
Source tasks have not any upstreamClocks. So, startClock is the minimum of 
pending clocks when recover happen.

eg below:
source task1: timeStamp:15,not ACK, minClockValue maybe is 15(<= 15).
source task2: timeStamp:10,ACKed, minClockValue maybe is Long.MaxValue
when recover happen,startClock maybe is 15. where is the data between 10 to 15 
at task2?
{quote}

More context on this issue:

In Gearpump, we maintain a global minimum clock tracked from a message's 
timestamp across all tasks. It means messages with timestamp before this clock 
have all been processed. An application will restart from this value on 
failure, and thus at-least-once message delivery could be guaranteed. 

The global minimum clock is the lower bound of all the Tasks' minimum clocks. 
For a task, the minimum clock is the lower of 

  1. upstream minimum clock
  2. a. the minimum timestamp of unacked messages
  b. Long.MaxValue if all messages have been acked.
 
Note that 2.b allows the global minimum clock to progress and it is almost safe 
since the clock is also bounded by the upstream minimum clock. I said "almost 
safe" because a source task has no upstream but we assume the upstream minimum 
clock is Long.MaxValue. Thus, the scenario described by Zhu Yueqian could 
happen and breaks at-least-once guarantee. 






> Minimum clock of source Tasks maybe inaccurate
> --
>
> Key: GEARPUMP-32
> URL: https://issues.apache.org/jira/browse/GEARPUMP-32
> Project: Apache Gearpump
>  Issue Type: Bug
>  Components: streaming
>    Affects Versions: 0.8.0
>    Reporter: Manu Zhang
>Assignee: Manu Zhang
>
> Moved from [https://github.com/gearpump/gearpump/issues/1835] and reported by 
> [Zhu Yueqian|https://github.com/yueqianzhu]
> {quote}
> Source tasks have not any upstreamClocks. So, startClock is the minimum of 
> pending clocks when recover happen.
> eg below:
> source task1: timeStamp:15,not ACK, minClockValue maybe is 15(<= 15).
> source task2: timeStamp:10,ACKed, minClockValue maybe is Long.MaxValue
> when recover happen,startClock maybe is 15. where is the data between 10 to 
> 15 at task2?
> {quote}
> More context on this issue:
> In Gearpump, we maintain a global minimum clock tracked from a message's 
> timestamp across all tasks. It means messages with timestamp before this 
> clock have all been processed. An application will restart from this value on 
> failure, and thus at-least-once message delivery could be guaranteed. 
> The global minimum clock is the lower bound of all the Tasks' minimum clocks. 
> For a task, the minimum clock is the lower of 
> # upstream minimum clock
> # a. the minimum timestamp of unacked messages
>b. Long.MaxValue if all messages have been acked.
>  
> Note that 2.b allows the global minimum clock to progress and it is almost 
> safe since the clock is also bounded by the upstream minimum clock. I said 
> "almost safe" because a source task has no upstream but we

[jira] [Created] (GEARPUMP-32) Minimum clock of source Tasks maybe inaccurate

2016-04-14 Thread Manu Zhang (JIRA)
Manu Zhang created GEARPUMP-32:
--

 Summary: Minimum clock of source Tasks maybe inaccurate
 Key: GEARPUMP-32
 URL: https://issues.apache.org/jira/browse/GEARPUMP-32
 Project: Apache Gearpump
  Issue Type: Bug
  Components: streaming
Affects Versions: 0.8.0
Reporter: Manu Zhang
Assignee: Manu Zhang


Moved from [https://github.com/gearpump/gearpump/issues/1835] and reported by 
[Zhu Yueqian|https://github.com/yueqianzhu]

{quote}
Source tasks have not any upstreamClocks. So, startClock is the minimum of 
pending clocks when recover happen.

eg below:
source task1: timeStamp:15,not ACK, minClockValue maybe is 15(<= 15).
source task2: timeStamp:10,ACKed, minClockValue maybe is Long.MaxValue
when recover happen,startClock maybe is 15. where is the data between 10 to 15 
at task2?
{quote}

More context on this issue:

In Gearpump, we maintain a global minimum clock tracked from a message's 
timestamp across all tasks. It means messages with timestamp before this clock 
have all been processed. An application will restart from this value on 
failure, and thus at-least-once message delivery could be guaranteed. 

The global minimum clock is the lower bound of all the Tasks' minimum clocks. 
For a task, the minimum clock is the lower of 

  1. upstream minimum clock
  2. a. the minimum timestamp of unacked messages
  b. Long.MaxValue if all messages have been acked.
 
Note that 2.b allows the global minimum clock to progress and it is almost safe 
since the clock is also bounded by the upstream minimum clock. I said "almost 
safe" because a source task has no upstream but we assume the upstream minimum 
clock is Long.MaxValue. Thus, the scenario described by Zhu Yueqian could 
happen and breaks at-least-once guarantee. 







--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (GEARPUMP-24) refactor DataSource API

2016-04-14 Thread Manu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/GEARPUMP-24?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15234390#comment-15234390
 ] 

Manu Zhang edited comment on GEARPUMP-24 at 4/15/16 3:01 AM:
-

PR is available at [https://github.com/gearpump/gearpump/pull/2014].

Copy [~clockfly]'s comment here:

bq. Looks you removed batching. Generally batching have performance advantage, 
have you done some micro-benchmark on this?


was (Author: mauzhang):
PR is available at [https://github.com/gearpump/gearpump/pull/2014]

> refactor DataSource API
> ---
>
> Key: GEARPUMP-24
> URL: https://issues.apache.org/jira/browse/GEARPUMP-24
> Project: Apache Gearpump
>  Issue Type: Improvement
>  Components: streaming
>Affects Versions: 0.8.0
>    Reporter: Manu Zhang
>Assignee: Manu Zhang
> Fix For: 0.8.1
>
>
> From [https://github.com/gearpump/gearpump/issues/2013]:
> The current DataSource API
> {code}
> trait DataSource extends java.io.Serializable {
>   /**
>* open connection to data source
>* invoked in onStart() method of 
> [[io.gearpump.streaming.source.DataSourceTask]]
>* @param context is the task context at runtime
>* @param startTime is the start time of system
>*/
>   def open(context: TaskContext, startTime: Option[TimeStamp]): Unit
>   /**
>* read a number of messages from data source.
>* invoked in each onNext() method of 
> [[io.gearpump.streaming.source.DataSourceTask]]
>* @param batchSize max number of messages to read
>* @return a list of messages wrapped in [[io.gearpump.Message]]
>*/
>   def read(batchSize: Int): List[Message]
>   /**
>* close connection to data source.
>* invoked in onStop() method of 
> [[io.gearpump.streaming.source.DataSourceTask]]
>*/
>   def close(): Unit
> }
> {code}
> has several issues
> 1. read returns a scala list of Message which is unfriendly to Java 
> DataSources. Same for Option parameter in open
> 2. the number of read messages may not be the same as the passed in batchSize 
> which leaves uncertainty to users (users may access out of boundary list 
> positions)
> 3. to return a list an extra buffer could be needed in read (e.g. 
> KafkaSource) which is not best for performance



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (GEARPUMP-1) Gearpump bootup

2016-03-29 Thread Manu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/GEARPUMP-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15217278#comment-15217278
 ] 

Manu Zhang commented on GEARPUMP-1:
---

1. We also have a Commit list, comm...@gearpump.incubator.apache.org 
2. how about the documentation site ?

> Gearpump bootup
> ---
>
> Key: GEARPUMP-1
> URL: https://issues.apache.org/jira/browse/GEARPUMP-1
> Project: Apache Gearpump
>  Issue Type: New Feature
>Reporter: Sean Zhong
>Assignee: Sean Zhong
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Here are some initial steps that need tracking. Welcome to add more in this 
> list:
> 1. Sign the SGA and CCLA, \[Done\]
> 2. Setup Git repo at git://git.apache.org/incubator-gearpupm.git \[Open\]
> 3. Setup Github integration. \[Open\]
> 4. Import initial source code of SGA donation at commits 
> d5343681edde427022bf4c225ce6022a6904ae88 to the Apache git repo. \[Open\]
> 5. Merge additional code change after commits 
> d5343681edde427022bf4c225ce6022a6904ae88 to the Apache git repo. \[Open\]
> 6. Setup the OSS Sonatype repo with new artifact org.apache.gearpump \[Open\]
> 7. Setup mail lists, User, Dev, and Private \[Done\]
>   - Subscribe link to User list: 
> [mailto:user-subscr...@gearpump.incubator.apache.org]
>   - Subscribe link to Dev list: 
> [mailto:dev-subscr...@gearpump.incubator.apache.org]
>   - Subscribe link to Private list: 
> [mailto:private-subscr...@gearpump.incubator.apache.org]
>   
> 8. If using the Apache Jira for issue tracking, and Github integration for 
> code contribution, 
> we need to write a How-To to document how to make contributation. \[Open\]
> 9. Import issues from [github 
> issue|https://github.com/gearpump/gearpump/issues] to Apache Jira. \[Open\]
> 10. Fix the code style by following [Style 
> Guide|https://github.com/gearpump/gearpump/issues/2011] \[Open\]
> 11. Make the first release with new artifact org.apache.gearpump  \[Open\]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


<    1   2   3   4   5   6