I wanted to summarize here a couple of important points raised in some PRs I was involved with, and while those PRs were about KafkaIO and related to the Spark/Direct runners, some important notes were made and I think they are worth this thread.
*Background*: The KafkaIO waits (5 seconds) before starting to read, and (10 millis) between advancing the reader, which is problematic for the Spark runner as it might attempt to read (every microbatch) for a shorter period, and so it will never start. Raghu Angadi mentioned in this conversation <https://github.com/apache/incubator-beam/pull/1071> that originally the wait period was set to 10 millis for both start() and advance(), but it was changed mostly due to DirectRunner issues. PR#1125 <https://github.com/apache/incubator-beam/pull/1125> originally attempted to allow runners to configure those properties via PipelineOptions, but Dan Halperin raised some interesting points: - Readers should return as soon as they are able. - Runners may poll advance() in a loop for a certain period of time if it returned too fast. - Runners must tolerate sources that take a long time to start or advance, because real systems operate that way. Dan suggested (and I clearly agreed) that this should be discussed here. BTW, Thomas Groh mentioned that the DirectRunner is OK now with much shorter wait periods, and PR#1125 now aims to re-set the 5 seconds wait-on-start to 10 millis. I think this is a good example of Dan's points here: Kafka reader can clearly return much faster (millis not seconds) and the DirectRunner accommodates this now by reusing it's readers. I Hope I didn't forget/misunderstood anything (please correct me if I did). Thanks, Amit