Thanks for the update and summary. Regards JB
On Oct 18, 2016, 20:47, at 20:47, Amit Sela <[email protected]> wrote: >I wanted to summarize here a couple of important points raised in some >PRs >I was involved with, and while those PRs were about KafkaIO and related >to >the Spark/Direct runners, some important notes were made and I think >they >are worth this thread. > >*Background*: >The KafkaIO waits (5 seconds) before starting to read, and (10 millis) >between advancing the reader, which is problematic for the Spark runner >as >it might attempt to read (every microbatch) for a shorter period, and >so it >will never start. >Raghu Angadi mentioned in this conversation ><https://github.com/apache/incubator-beam/pull/1071> that originally >the >wait period was set to 10 millis for both start() and advance(), but it >was >changed mostly due to DirectRunner issues. >PR#1125 <https://github.com/apache/incubator-beam/pull/1125> originally >attempted to allow runners to configure those properties via >PipelineOptions, but Dan Halperin raised some interesting points: > > - Readers should return as soon as they are able. > - Runners may poll advance() in a loop for a certain period of time if > it returned too fast. > - Runners must tolerate sources that take a long time to start or > advance, because real systems operate that way. > >Dan suggested (and I clearly agreed) that this should be discussed >here. > >BTW, Thomas Groh mentioned that the DirectRunner is OK now with much >shorter wait periods, and PR#1125 now aims to re-set the 5 seconds >wait-on-start to 10 millis. >I think this is a good example of Dan's points here: Kafka reader can >clearly return much faster (millis not seconds) and the DirectRunner >accommodates this now by reusing it's readers. > >I Hope I didn't forget/misunderstood anything (please correct me if I >did). > >Thanks, >Amit
