Re: [DISCUSS] Sources and Runners

Raghu Angadi Tue, 18 Oct 2016 12:04:48 -0700

One way I would rephrase the concern from unbounded source developer point
of view :


 - What is the recommended blocking behavior for start() and advance()?

E.g. on one extreme should they wait till there is a record? Mostly this
will be bad.
I am glad at pull/1125 <https://github.com/apache/incubator-beam/pull/1125> is
reverting the timeout back to 10millis. This is much simpler for KafkaIO
and hopefully give enough flexibility for runner to interact with such
sources optimally.

I like the three rules of thumb mentioned below. I think something like
that should be in JavaDoc for 'advance()'.



On Tue, Oct 18, 2016 at 11:46 AM, Amit Sela <[email protected]> wrote:

> I wanted to summarize here a couple of important points raised in some PRs
> I was involved with, and while those PRs were about KafkaIO and related to
> the Spark/Direct runners, some important notes were made and I think they
> are worth this thread.
>
> *Background*:
> The KafkaIO waits (5 seconds) before starting to read, and (10 millis)
> between advancing the reader, which is problematic for the Spark runner as
> it might attempt to read (every microbatch) for a shorter period, and so it
> will never start.
> Raghu Angadi mentioned in this conversation
> <https://github.com/apache/incubator-beam/pull/1071> that originally the
> wait period was set to 10 millis for both start() and advance(), but it was
> changed mostly due to DirectRunner issues.
> PR#1125 <https://github.com/apache/incubator-beam/pull/1125> originally
> attempted to allow runners to configure those properties via
> PipelineOptions, but Dan Halperin raised some interesting points:
>
>    - Readers should return as soon as they are able.
>    - Runners may poll advance() in a loop for a certain period of time if
>    it returned too fast.
>    - Runners must tolerate sources that take a long time to start or
>    advance, because real systems operate that way.
>
> Dan suggested (and I clearly agreed) that this should be discussed here.
>
> BTW, Thomas Groh mentioned that the DirectRunner is OK now with much
> shorter wait periods, and PR#1125 now aims to re-set the 5 seconds
> wait-on-start to 10 millis.
> I think this is a good example of Dan's points here:  Kafka reader can
> clearly return much faster (millis not seconds) and the DirectRunner
> accommodates this now by reusing it's readers.
>
> I Hope I didn't forget/misunderstood anything (please correct me if I did).
>
> Thanks,
> Amit
>

Re: [DISCUSS] Sources and Runners

Reply via email to