+Jason, looping him in directly because he might have an opinion on what
I'm going to say.

Should we maybe add integration tests that verify that all runners can
correctly read from and write to an external system in a complete Pipeline.
At least for Kafka, which seems to be the most used option in the
open-source space, this would make a lot of sense, IMHO.

On Tue, 18 Oct 2016 at 21:49 Robert Bradshaw <rober...@google.com.invalid>
wrote:

> Eventually we'll be able to communicate intent with the runner much
> more directly via the ProcessContinuation object:
>
>
> https://github.com/apache/incubator-beam/blob/a0f649eaca8d8bd47d22db0ba7150fea1bf07975/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/DoFn.java#L658
>
> On Tue, Oct 18, 2016 at 12:44 PM, Jean-Baptiste Onofré <j...@nanthrax.net>
> wrote:
> > Thanks for the update and summary.
> >
> > Regards
> > JB
> >
> > ⁣
> >
> > On Oct 18, 2016, 20:47, at 20:47, Amit Sela <amitsel...@gmail.com>
> wrote:
> >>I wanted to summarize here a couple of important points raised in some
> >>PRs
> >>I was involved with, and while those PRs were about KafkaIO and related
> >>to
> >>the Spark/Direct runners, some important notes were made and I think
> >>they
> >>are worth this thread.
> >>
> >>*Background*:
> >>The KafkaIO waits (5 seconds) before starting to read, and (10 millis)
> >>between advancing the reader, which is problematic for the Spark runner
> >>as
> >>it might attempt to read (every microbatch) for a shorter period, and
> >>so it
> >>will never start.
> >>Raghu Angadi mentioned in this conversation
> >><https://github.com/apache/incubator-beam/pull/1071> that originally
> >>the
> >>wait period was set to 10 millis for both start() and advance(), but it
> >>was
> >>changed mostly due to DirectRunner issues.
> >>PR#1125 <https://github.com/apache/incubator-beam/pull/1125> originally
> >>attempted to allow runners to configure those properties via
> >>PipelineOptions, but Dan Halperin raised some interesting points:
> >>
> >>   - Readers should return as soon as they are able.
> >> - Runners may poll advance() in a loop for a certain period of time if
> >>   it returned too fast.
> >>   - Runners must tolerate sources that take a long time to start or
> >>   advance, because real systems operate that way.
> >>
> >>Dan suggested (and I clearly agreed) that this should be discussed
> >>here.
> >>
> >>BTW, Thomas Groh mentioned that the DirectRunner is OK now with much
> >>shorter wait periods, and PR#1125 now aims to re-set the 5 seconds
> >>wait-on-start to 10 millis.
> >>I think this is a good example of Dan's points here:  Kafka reader can
> >>clearly return much faster (millis not seconds) and the DirectRunner
> >>accommodates this now by reusing it's readers.
> >>
> >>I Hope I didn't forget/misunderstood anything (please correct me if I
> >>did).
> >>
> >>Thanks,
> >>Amit
>

Reply via email to