Re: [DISCUSS] Sources and Runners

Dan Halperin Wed, 19 Oct 2016 09:45:06 -0700

My thoughts:

* It's worth reading the Beam testing
<http://beam.incubator.apache.org/contribute/testing/> document that Jason
Kuster wrote!
* Beam already has support for "End-to-end" integration tests, of examples
(e.g., WordCountIT
<https://github.com/apache/incubator-beam/blob/master/examples/java/src/test/java/org/apache/beam/examples/WordCountIT.java>)
or data stores (e.g., BigtableReadIT
<https://github.com/apache/incubator-beam/blob/master/sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/bigtable/BigtableReadIT.java>
).


It sounds like we're is in agreement that we want one of these for KafkaIO,
too. (Among others,) This now circles back to other issues of testing
(which are not yet covered in that doc):

* Unlike Bigtable, Kafka is not a cloud service. We'll either need a
permanent testing cluster or the ability to spin one up dynamically. That
work is hard, but we need to figure it out. (Note: the cluster needs to be
accessible from all runners and real clusters, so not just the local
machine like the integration tests.)
* Right now, only WordCountIT and a few other example ITs are run on all
runners. We need to add per-runner postcommit suites that run all ITs in
the project.

Jason and several others of us have been thinking hard about the best ways
to build these tests and necessary test infrastructure. (See the
performance thread Jason started. IMO the most important issue to solve
first is infrastructure). Please help!

Dan

On Wed, Oct 19, 2016 at 7:37 AM, Thomas Weise <[email protected]> wrote:

> +1 those are probably the most used sources. Hadoop FS has a number of
> different implementations, HDFS is one of them.
>
> On Wed, Oct 19, 2016 at 2:55 AM, Amit Sela <[email protected]> wrote:
>
> > I agree with Aljoscha about Kafka.
> >
> > How about having one integration test for BoundedSource and one for
> > UnboundedSource ? from apache perspective it makes sense to test this
> > end-to-end on HDFS and Kafka (respectively).
> >
> > On Wed, Oct 19, 2016 at 11:34 AM Aljoscha Krettek <[email protected]>
> > wrote:
> >
> > > +Jason, looping him in directly because he might have an opinion on
> what
> > > I'm going to say.
> > >
> > > Should we maybe add integration tests that verify that all runners can
> > > correctly read from and write to an external system in a complete
> > Pipeline.
> > > At least for Kafka, which seems to be the most used option in the
> > > open-source space, this would make a lot of sense, IMHO.
> > >
> > > On Tue, 18 Oct 2016 at 21:49 Robert Bradshaw
> <[email protected]
> > >
> > > wrote:
> > >
> > > > Eventually we'll be able to communicate intent with the runner much
> > > > more directly via the ProcessContinuation object:
> > > >
> > > >
> > > >
> > > https://github.com/apache/incubator-beam/blob/
> > a0f649eaca8d8bd47d22db0ba7150fea1bf07975/sdks/java/core/src/
> > main/java/org/apache/beam/sdk/transforms/DoFn.java#L658
> > > >
> > > > On Tue, Oct 18, 2016 at 12:44 PM, Jean-Baptiste Onofré <
> > [email protected]>
> > > > wrote:
> > > > > Thanks for the update and summary.
> > > > >
> > > > > Regards
> > > > > JB
> > > > >
> > > > > ⁣
> > > > >
> > > > > On Oct 18, 2016, 20:47, at 20:47, Amit Sela <[email protected]>
> > > > wrote:
> > > > >>I wanted to summarize here a couple of important points raised in
> > some
> > > > >>PRs
> > > > >>I was involved with, and while those PRs were about KafkaIO and
> > related
> > > > >>to
> > > > >>the Spark/Direct runners, some important notes were made and I
> think
> > > > >>they
> > > > >>are worth this thread.
> > > > >>
> > > > >>*Background*:
> > > > >>The KafkaIO waits (5 seconds) before starting to read, and (10
> > millis)
> > > > >>between advancing the reader, which is problematic for the Spark
> > runner
> > > > >>as
> > > > >>it might attempt to read (every microbatch) for a shorter period,
> and
> > > > >>so it
> > > > >>will never start.
> > > > >>Raghu Angadi mentioned in this conversation
> > > > >><https://github.com/apache/incubator-beam/pull/1071> that
> originally
> > > > >>the
> > > > >>wait period was set to 10 millis for both start() and advance(),
> but
> > it
> > > > >>was
> > > > >>changed mostly due to DirectRunner issues.
> > > > >>PR#1125 <https://github.com/apache/incubator-beam/pull/1125>
> > > originally
> > > > >>attempted to allow runners to configure those properties via
> > > > >>PipelineOptions, but Dan Halperin raised some interesting points:
> > > > >>
> > > > >>   - Readers should return as soon as they are able.
> > > > >> - Runners may poll advance() in a loop for a certain period of
> time
> > if
> > > > >>   it returned too fast.
> > > > >>   - Runners must tolerate sources that take a long time to start
> or
> > > > >>   advance, because real systems operate that way.
> > > > >>
> > > > >>Dan suggested (and I clearly agreed) that this should be discussed
> > > > >>here.
> > > > >>
> > > > >>BTW, Thomas Groh mentioned that the DirectRunner is OK now with
> much
> > > > >>shorter wait periods, and PR#1125 now aims to re-set the 5 seconds
> > > > >>wait-on-start to 10 millis.
> > > > >>I think this is a good example of Dan's points here:  Kafka reader
> > can
> > > > >>clearly return much faster (millis not seconds) and the
> DirectRunner
> > > > >>accommodates this now by reusing it's readers.
> > > > >>
> > > > >>I Hope I didn't forget/misunderstood anything (please correct me
> if I
> > > > >>did).
> > > > >>
> > > > >>Thanks,
> > > > >>Amit
> > > >
> > >
> >
>

Re: [DISCUSS] Sources and Runners

Reply via email to