One way I would rephrase the concern from unbounded source developer point of view :
- What is the recommended blocking behavior for start() and advance()? E.g. on one extreme should they wait till there is a record? Mostly this will be bad. I am glad at pull/1125 <https://github.com/apache/incubator-beam/pull/1125> is reverting the timeout back to 10millis. This is much simpler for KafkaIO and hopefully give enough flexibility for runner to interact with such sources optimally. I like the three rules of thumb mentioned below. I think something like that should be in JavaDoc for 'advance()'. On Tue, Oct 18, 2016 at 11:46 AM, Amit Sela <[email protected]> wrote: > I wanted to summarize here a couple of important points raised in some PRs > I was involved with, and while those PRs were about KafkaIO and related to > the Spark/Direct runners, some important notes were made and I think they > are worth this thread. > > *Background*: > The KafkaIO waits (5 seconds) before starting to read, and (10 millis) > between advancing the reader, which is problematic for the Spark runner as > it might attempt to read (every microbatch) for a shorter period, and so it > will never start. > Raghu Angadi mentioned in this conversation > <https://github.com/apache/incubator-beam/pull/1071> that originally the > wait period was set to 10 millis for both start() and advance(), but it was > changed mostly due to DirectRunner issues. > PR#1125 <https://github.com/apache/incubator-beam/pull/1125> originally > attempted to allow runners to configure those properties via > PipelineOptions, but Dan Halperin raised some interesting points: > > - Readers should return as soon as they are able. > - Runners may poll advance() in a loop for a certain period of time if > it returned too fast. > - Runners must tolerate sources that take a long time to start or > advance, because real systems operate that way. > > Dan suggested (and I clearly agreed) that this should be discussed here. > > BTW, Thomas Groh mentioned that the DirectRunner is OK now with much > shorter wait periods, and PR#1125 now aims to re-set the 5 seconds > wait-on-start to 10 millis. > I think this is a good example of Dan's points here: Kafka reader can > clearly return much faster (millis not seconds) and the DirectRunner > accommodates this now by reusing it's readers. > > I Hope I didn't forget/misunderstood anything (please correct me if I did). > > Thanks, > Amit >
