[ https://issues.apache.org/jira/browse/GEARPUMP-24?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15242429#comment-15242429 ]
Manu Zhang edited comment on GEARPUMP-24 at 4/15/16 4:50 AM: ------------------------------------------------------------- Hi [~whjiang], For KafkaSource, we already have an internal queue for messages asyncly fetched from Kafka. I feel more like allowing users to access the internal queue directly than copy messages to another in memory queue. This may be applied to other sources and I think an iterator like interface is more general and safer. Hi [~clockfly], Users still do a batch of reads in one Task invocation and data sources are free to pull in data in batch or not. I'll perform a benchmark on KafkaSource. was (Author: mauzhang): Hi [~whjiang], For KafkaSource, we already have an internal queue for messages asyncly fetched from Kafka. I feel more like allowing users to access the internal queue directly than copy messages to another in memory queue. This may be applied to other sources. Hi [~clockfly], Users still do a batch of reads in one Task invocation and data sources are free to pull in data in batch or not. I'll perform a benchmark on KafkaSource. > refactor DataSource API > ----------------------- > > Key: GEARPUMP-24 > URL: https://issues.apache.org/jira/browse/GEARPUMP-24 > Project: Apache Gearpump > Issue Type: Improvement > Components: streaming > Affects Versions: 0.8.0 > Reporter: Manu Zhang > Assignee: Manu Zhang > Fix For: 0.8.1 > > > From [https://github.com/gearpump/gearpump/issues/2013]: > The current DataSource API > {code} > trait DataSource extends java.io.Serializable { > /** > * open connection to data source > * invoked in onStart() method of > [[io.gearpump.streaming.source.DataSourceTask]] > * @param context is the task context at runtime > * @param startTime is the start time of system > */ > def open(context: TaskContext, startTime: Option[TimeStamp]): Unit > /** > * read a number of messages from data source. > * invoked in each onNext() method of > [[io.gearpump.streaming.source.DataSourceTask]] > * @param batchSize max number of messages to read > * @return a list of messages wrapped in [[io.gearpump.Message]] > */ > def read(batchSize: Int): List[Message] > /** > * close connection to data source. > * invoked in onStop() method of > [[io.gearpump.streaming.source.DataSourceTask]] > */ > def close(): Unit > } > {code} > has several issues > 1. read returns a scala list of Message which is unfriendly to Java > DataSources. Same for Option parameter in open > 2. the number of read messages may not be the same as the passed in batchSize > which leaves uncertainty to users (users may access out of boundary list > positions) > 3. to return a list an extra buffer could be needed in read (e.g. > KafkaSource) which is not best for performance > Update: > I'd like to add a "getWatermark" interface that helps to fix GEARPUMP-32 -- This message was sent by Atlassian JIRA (v6.3.4#6332)