I wanted to summarize here a couple of important points raised in some PRs
I was involved with, and while those PRs were about KafkaIO and related to
the Spark/Direct runners, some important notes were made and I think they
are worth this thread.

*Background*:
The KafkaIO waits (5 seconds) before starting to read, and (10 millis)
between advancing the reader, which is problematic for the Spark runner as
it might attempt to read (every microbatch) for a shorter period, and so it
will never start.
Raghu Angadi mentioned in this conversation
<https://github.com/apache/incubator-beam/pull/1071> that originally the
wait period was set to 10 millis for both start() and advance(), but it was
changed mostly due to DirectRunner issues.
PR#1125 <https://github.com/apache/incubator-beam/pull/1125> originally
attempted to allow runners to configure those properties via
PipelineOptions, but Dan Halperin raised some interesting points:

   - Readers should return as soon as they are able.
   - Runners may poll advance() in a loop for a certain period of time if
   it returned too fast.
   - Runners must tolerate sources that take a long time to start or
   advance, because real systems operate that way.

Dan suggested (and I clearly agreed) that this should be discussed here.

BTW, Thomas Groh mentioned that the DirectRunner is OK now with much
shorter wait periods, and PR#1125 now aims to re-set the 5 seconds
wait-on-start to 10 millis.
I think this is a good example of Dan's points here:  Kafka reader can
clearly return much faster (millis not seconds) and the DirectRunner
accommodates this now by reusing it's readers.

I Hope I didn't forget/misunderstood anything (please correct me if I did).

Thanks,
Amit

Reply via email to