Re: [DISCUSS] Sources and Runners

Thomas Weise Wed, 19 Oct 2016 10:58:10 -0700

Hadoop FS has the local file system implementation that can be used for
testing ("file" URL, no service needed).


Thanks

On Wed, Oct 19, 2016 at 10:43 AM, Amit Sela <[email protected]> wrote:

> Oh cool, that didn't exist in 0.8 I think, but anything that is Kafka
> native is best.
> I'm pretty sure there's an embedded HDFS for testing as well.
>
> While embedded Kafka/HDFS won't reflect "real-life" distributed
> environment, it could be a good place to start and provide some basic
> functional testing.
>
> On Wed, Oct 19, 2016 at 8:25 PM Satish Duggana <[email protected]> wrote:
>
> >
> > https://github.com/apache/kafka/blob/trunk/streams/src/
> test/java/org/apache/kafka/streams/integration/utils/
> EmbeddedKafkaCluster.java
> >
> > This is currently used in one of our repos and it comes as part of one of
> > kafka libs.
> >
> > On Wed, Oct 19, 2016 at 10:49 PM, Amit Sela <[email protected]>
> wrote:
> >
> > > The SparkRunner actually has an embedded Kafka for its unit tests.
> > >
> > > On Wed, Oct 19, 2016, 20:16 Thomas Weise <[email protected]> wrote:
> > >
> > > > Kafka can be embedded for the integration testing, which should
> > > > significantly simplify the setup.
> > > >
> > > > Here is an example I found:
> > > >
> > > > https://gist.github.com/fjavieralba/7930018
> > > >
> > > > Thanks,
> > > > Thomas
> > > >
> > > >
> > > >
> > > >
> > > > On Wed, Oct 19, 2016 at 9:44 AM, Dan Halperin
> > > <[email protected]
> > > > >
> > > > wrote:
> > > >
> > > > > My thoughts:
> > > > >
> > > > > * It's worth reading the Beam testing
> > > > > <http://beam.incubator.apache.org/contribute/testing/> document
> that
> > > > Jason
> > > > > Kuster wrote!
> > > > > * Beam already has support for "End-to-end" integration tests, of
> > > > examples
> > > > > (e.g., WordCountIT
> > > > > <https://github.com/apache/incubator-beam/blob/master/
> > > > > examples/java/src/test/java/org/apache/beam/examples/
> > > WordCountIT.java>)
> > > > > or data stores (e.g., BigtableReadIT
> > > > > <https://github.com/apache/incubator-beam/blob/master/
> > > > > sdks/java/io/google-cloud-platform/src/test/java/org/
> > > > > apache/beam/sdk/io/gcp/bigtable/BigtableReadIT.java>
> > > > > ).
> > > > >
> > > > > It sounds like we're is in agreement that we want one of these for
> > > > KafkaIO,
> > > > > too. (Among others,) This now circles back to other issues of
> testing
> > > > > (which are not yet covered in that doc):
> > > > >
> > > > > * Unlike Bigtable, Kafka is not a cloud service. We'll either need
> a
> > > > > permanent testing cluster or the ability to spin one up
> dynamically.
> > > That
> > > > > work is hard, but we need to figure it out. (Note: the cluster
> needs
> > to
> > > > be
> > > > > accessible from all runners and real clusters, so not just the
> local
> > > > > machine like the integration tests.)
> > > > > * Right now, only WordCountIT and a few other example ITs are run
> on
> > > all
> > > > > runners. We need to add per-runner postcommit suites that run all
> ITs
> > > in
> > > > > the project.
> > > > >
> > > > > Jason and several others of us have been thinking hard about the
> best
> > > > ways
> > > > > to build these tests and necessary test infrastructure. (See the
> > > > > performance thread Jason started. IMO the most important issue to
> > solve
> > > > > first is infrastructure). Please help!
> > > > >
> > > > > Dan
> > > > >
> > > > > On Wed, Oct 19, 2016 at 7:37 AM, Thomas Weise <[email protected]>
> > wrote:
> > > > >
> > > > > > +1 those are probably the most used sources. Hadoop FS has a
> number
> > > of
> > > > > > different implementations, HDFS is one of them.
> > > > > >
> > > > > > On Wed, Oct 19, 2016 at 2:55 AM, Amit Sela <[email protected]
> >
> > > > wrote:
> > > > > >
> > > > > > > I agree with Aljoscha about Kafka.
> > > > > > >
> > > > > > > How about having one integration test for BoundedSource and one
> > for
> > > > > > > UnboundedSource ? from apache perspective it makes sense to
> test
> > > this
> > > > > > > end-to-end on HDFS and Kafka (respectively).
> > > > > > >
> > > > > > > On Wed, Oct 19, 2016 at 11:34 AM Aljoscha Krettek <
> > > > [email protected]
> > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > +Jason, looping him in directly because he might have an
> > opinion
> > > on
> > > > > > what
> > > > > > > > I'm going to say.
> > > > > > > >
> > > > > > > > Should we maybe add integration tests that verify that all
> > > runners
> > > > > can
> > > > > > > > correctly read from and write to an external system in a
> > complete
> > > > > > > Pipeline.
> > > > > > > > At least for Kafka, which seems to be the most used option in
> > the
> > > > > > > > open-source space, this would make a lot of sense, IMHO.
> > > > > > > >
> > > > > > > > On Tue, 18 Oct 2016 at 21:49 Robert Bradshaw
> > > > > > <[email protected]
> > > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Eventually we'll be able to communicate intent with the
> > runner
> > > > much
> > > > > > > > > more directly via the ProcessContinuation object:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > https://github.com/apache/incubator-beam/blob/
> > > > > > > a0f649eaca8d8bd47d22db0ba7150fea1bf07975/sdks/java/core/src/
> > > > > > > main/java/org/apache/beam/sdk/transforms/DoFn.java#L658
> > > > > > > > >
> > > > > > > > > On Tue, Oct 18, 2016 at 12:44 PM, Jean-Baptiste Onofré <
> > > > > > > [email protected]>
> > > > > > > > > wrote:
> > > > > > > > > > Thanks for the update and summary.
> > > > > > > > > >
> > > > > > > > > > Regards
> > > > > > > > > > JB
> > > > > > > > > >
> > > > > > > > > > ⁣
> > > > > > > > > >
> > > > > > > > > > On Oct 18, 2016, 20:47, at 20:47, Amit Sela <
> > > > > [email protected]>
> > > > > > > > > wrote:
> > > > > > > > > >>I wanted to summarize here a couple of important points
> > > raised
> > > > in
> > > > > > > some
> > > > > > > > > >>PRs
> > > > > > > > > >>I was involved with, and while those PRs were about
> KafkaIO
> > > and
> > > > > > > related
> > > > > > > > > >>to
> > > > > > > > > >>the Spark/Direct runners, some important notes were made
> > and
> > > I
> > > > > > think
> > > > > > > > > >>they
> > > > > > > > > >>are worth this thread.
> > > > > > > > > >>
> > > > > > > > > >>*Background*:
> > > > > > > > > >>The KafkaIO waits (5 seconds) before starting to read,
> and
> > > (10
> > > > > > > millis)
> > > > > > > > > >>between advancing the reader, which is problematic for
> the
> > > > Spark
> > > > > > > runner
> > > > > > > > > >>as
> > > > > > > > > >>it might attempt to read (every microbatch) for a shorter
> > > > period,
> > > > > > and
> > > > > > > > > >>so it
> > > > > > > > > >>will never start.
> > > > > > > > > >>Raghu Angadi mentioned in this conversation
> > > > > > > > > >><https://github.com/apache/incubator-beam/pull/1071>
> that
> > > > > > originally
> > > > > > > > > >>the
> > > > > > > > > >>wait period was set to 10 millis for both start() and
> > > > advance(),
> > > > > > but
> > > > > > > it
> > > > > > > > > >>was
> > > > > > > > > >>changed mostly due to DirectRunner issues.
> > > > > > > > > >>PR#1125 <
> > https://github.com/apache/incubator-beam/pull/1125>
> > > > > > > > originally
> > > > > > > > > >>attempted to allow runners to configure those properties
> > via
> > > > > > > > > >>PipelineOptions, but Dan Halperin raised some interesting
> > > > points:
> > > > > > > > > >>
> > > > > > > > > >>   - Readers should return as soon as they are able.
> > > > > > > > > >> - Runners may poll advance() in a loop for a certain
> > period
> > > of
> > > > > > time
> > > > > > > if
> > > > > > > > > >>   it returned too fast.
> > > > > > > > > >>   - Runners must tolerate sources that take a long time
> to
> > > > start
> > > > > > or
> > > > > > > > > >>   advance, because real systems operate that way.
> > > > > > > > > >>
> > > > > > > > > >>Dan suggested (and I clearly agreed) that this should be
> > > > > discussed
> > > > > > > > > >>here.
> > > > > > > > > >>
> > > > > > > > > >>BTW, Thomas Groh mentioned that the DirectRunner is OK
> now
> > > with
> > > > > > much
> > > > > > > > > >>shorter wait periods, and PR#1125 now aims to re-set the
> 5
> > > > > seconds
> > > > > > > > > >>wait-on-start to 10 millis.
> > > > > > > > > >>I think this is a good example of Dan's points here:
> Kafka
> > > > > reader
> > > > > > > can
> > > > > > > > > >>clearly return much faster (millis not seconds) and the
> > > > > > DirectRunner
> > > > > > > > > >>accommodates this now by reusing it's readers.
> > > > > > > > > >>
> > > > > > > > > >>I Hope I didn't forget/misunderstood anything (please
> > correct
> > > > me
> > > > > > if I
> > > > > > > > > >>did).
> > > > > > > > > >>
> > > > > > > > > >>Thanks,
> > > > > > > > > >>Amit
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Sources and Runners

Reply via email to