Re: [DISCUSS] Sources and Runners

Thomas Weise Wed, 19 Oct 2016 07:39:04 -0700

+1 those are probably the most used sources. Hadoop FS has a number of
different implementations, HDFS is one of them.


On Wed, Oct 19, 2016 at 2:55 AM, Amit Sela <[email protected]> wrote:

> I agree with Aljoscha about Kafka.
>
> How about having one integration test for BoundedSource and one for
> UnboundedSource ? from apache perspective it makes sense to test this
> end-to-end on HDFS and Kafka (respectively).
>
> On Wed, Oct 19, 2016 at 11:34 AM Aljoscha Krettek <[email protected]>
> wrote:
>
> > +Jason, looping him in directly because he might have an opinion on what
> > I'm going to say.
> >
> > Should we maybe add integration tests that verify that all runners can
> > correctly read from and write to an external system in a complete
> Pipeline.
> > At least for Kafka, which seems to be the most used option in the
> > open-source space, this would make a lot of sense, IMHO.
> >
> > On Tue, 18 Oct 2016 at 21:49 Robert Bradshaw <[email protected]
> >
> > wrote:
> >
> > > Eventually we'll be able to communicate intent with the runner much
> > > more directly via the ProcessContinuation object:
> > >
> > >
> > >
> > https://github.com/apache/incubator-beam/blob/
> a0f649eaca8d8bd47d22db0ba7150fea1bf07975/sdks/java/core/src/
> main/java/org/apache/beam/sdk/transforms/DoFn.java#L658
> > >
> > > On Tue, Oct 18, 2016 at 12:44 PM, Jean-Baptiste Onofré <
> [email protected]>
> > > wrote:
> > > > Thanks for the update and summary.
> > > >
> > > > Regards
> > > > JB
> > > >
> > > > ⁣
> > > >
> > > > On Oct 18, 2016, 20:47, at 20:47, Amit Sela <[email protected]>
> > > wrote:
> > > >>I wanted to summarize here a couple of important points raised in
> some
> > > >>PRs
> > > >>I was involved with, and while those PRs were about KafkaIO and
> related
> > > >>to
> > > >>the Spark/Direct runners, some important notes were made and I think
> > > >>they
> > > >>are worth this thread.
> > > >>
> > > >>*Background*:
> > > >>The KafkaIO waits (5 seconds) before starting to read, and (10
> millis)
> > > >>between advancing the reader, which is problematic for the Spark
> runner
> > > >>as
> > > >>it might attempt to read (every microbatch) for a shorter period, and
> > > >>so it
> > > >>will never start.
> > > >>Raghu Angadi mentioned in this conversation
> > > >><https://github.com/apache/incubator-beam/pull/1071> that originally
> > > >>the
> > > >>wait period was set to 10 millis for both start() and advance(), but
> it
> > > >>was
> > > >>changed mostly due to DirectRunner issues.
> > > >>PR#1125 <https://github.com/apache/incubator-beam/pull/1125>
> > originally
> > > >>attempted to allow runners to configure those properties via
> > > >>PipelineOptions, but Dan Halperin raised some interesting points:
> > > >>
> > > >>   - Readers should return as soon as they are able.
> > > >> - Runners may poll advance() in a loop for a certain period of time
> if
> > > >>   it returned too fast.
> > > >>   - Runners must tolerate sources that take a long time to start or
> > > >>   advance, because real systems operate that way.
> > > >>
> > > >>Dan suggested (and I clearly agreed) that this should be discussed
> > > >>here.
> > > >>
> > > >>BTW, Thomas Groh mentioned that the DirectRunner is OK now with much
> > > >>shorter wait periods, and PR#1125 now aims to re-set the 5 seconds
> > > >>wait-on-start to 10 millis.
> > > >>I think this is a good example of Dan's points here:  Kafka reader
> can
> > > >>clearly return much faster (millis not seconds) and the DirectRunner
> > > >>accommodates this now by reusing it's readers.
> > > >>
> > > >>I Hope I didn't forget/misunderstood anything (please correct me if I
> > > >>did).
> > > >>
> > > >>Thanks,
> > > >>Amit
> > >
> >
>

Re: [DISCUSS] Sources and Runners

Reply via email to