Eventually we'll be able to communicate intent with the runner much more directly via the ProcessContinuation object:
https://github.com/apache/incubator-beam/blob/a0f649eaca8d8bd47d22db0ba7150fea1bf07975/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/DoFn.java#L658 On Tue, Oct 18, 2016 at 12:44 PM, Jean-Baptiste Onofré <j...@nanthrax.net> wrote: > Thanks for the update and summary. > > Regards > JB > > > > On Oct 18, 2016, 20:47, at 20:47, Amit Sela <amitsel...@gmail.com> wrote: >>I wanted to summarize here a couple of important points raised in some >>PRs >>I was involved with, and while those PRs were about KafkaIO and related >>to >>the Spark/Direct runners, some important notes were made and I think >>they >>are worth this thread. >> >>*Background*: >>The KafkaIO waits (5 seconds) before starting to read, and (10 millis) >>between advancing the reader, which is problematic for the Spark runner >>as >>it might attempt to read (every microbatch) for a shorter period, and >>so it >>will never start. >>Raghu Angadi mentioned in this conversation >><https://github.com/apache/incubator-beam/pull/1071> that originally >>the >>wait period was set to 10 millis for both start() and advance(), but it >>was >>changed mostly due to DirectRunner issues. >>PR#1125 <https://github.com/apache/incubator-beam/pull/1125> originally >>attempted to allow runners to configure those properties via >>PipelineOptions, but Dan Halperin raised some interesting points: >> >> - Readers should return as soon as they are able. >> - Runners may poll advance() in a loop for a certain period of time if >> it returned too fast. >> - Runners must tolerate sources that take a long time to start or >> advance, because real systems operate that way. >> >>Dan suggested (and I clearly agreed) that this should be discussed >>here. >> >>BTW, Thomas Groh mentioned that the DirectRunner is OK now with much >>shorter wait periods, and PR#1125 now aims to re-set the 5 seconds >>wait-on-start to 10 millis. >>I think this is a good example of Dan's points here: Kafka reader can >>clearly return much faster (millis not seconds) and the DirectRunner >>accommodates this now by reusing it's readers. >> >>I Hope I didn't forget/misunderstood anything (please correct me if I >>did). >> >>Thanks, >>Amit