Eventually we'll be able to communicate intent with the runner much
more directly via the ProcessContinuation object:

https://github.com/apache/incubator-beam/blob/a0f649eaca8d8bd47d22db0ba7150fea1bf07975/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/DoFn.java#L658

On Tue, Oct 18, 2016 at 12:44 PM, Jean-Baptiste Onofré <j...@nanthrax.net> 
wrote:
> Thanks for the update and summary.
>
> Regards
> JB
>
> ⁣
>
> On Oct 18, 2016, 20:47, at 20:47, Amit Sela <amitsel...@gmail.com> wrote:
>>I wanted to summarize here a couple of important points raised in some
>>PRs
>>I was involved with, and while those PRs were about KafkaIO and related
>>to
>>the Spark/Direct runners, some important notes were made and I think
>>they
>>are worth this thread.
>>
>>*Background*:
>>The KafkaIO waits (5 seconds) before starting to read, and (10 millis)
>>between advancing the reader, which is problematic for the Spark runner
>>as
>>it might attempt to read (every microbatch) for a shorter period, and
>>so it
>>will never start.
>>Raghu Angadi mentioned in this conversation
>><https://github.com/apache/incubator-beam/pull/1071> that originally
>>the
>>wait period was set to 10 millis for both start() and advance(), but it
>>was
>>changed mostly due to DirectRunner issues.
>>PR#1125 <https://github.com/apache/incubator-beam/pull/1125> originally
>>attempted to allow runners to configure those properties via
>>PipelineOptions, but Dan Halperin raised some interesting points:
>>
>>   - Readers should return as soon as they are able.
>> - Runners may poll advance() in a loop for a certain period of time if
>>   it returned too fast.
>>   - Runners must tolerate sources that take a long time to start or
>>   advance, because real systems operate that way.
>>
>>Dan suggested (and I clearly agreed) that this should be discussed
>>here.
>>
>>BTW, Thomas Groh mentioned that the DirectRunner is OK now with much
>>shorter wait periods, and PR#1125 now aims to re-set the 5 seconds
>>wait-on-start to 10 millis.
>>I think this is a good example of Dan's points here:  Kafka reader can
>>clearly return much faster (millis not seconds) and the DirectRunner
>>accommodates this now by reusing it's readers.
>>
>>I Hope I didn't forget/misunderstood anything (please correct me if I
>>did).
>>
>>Thanks,
>>Amit

Reply via email to