Re: [DISCUSS] Sources and Runners

Jean-Baptiste Onofré Wed, 19 Oct 2016 11:03:48 -0700

Hi

FYI when working on IO I already setup a docker image that I'm using for 
integration test.


The IO unit tests embed and bootstrap the IO resources when possible. For 
instance JmsIO unit tests start a embedded ActiveMQ broker. However I also have 
a ActiveMQ docker image that I use for integration test. I'm using a 
mesos/marathon cluster where I test the IO in "real condition". It contains the 
engine cluster (flink, spark, apex, ...) and the IO resources (elastocsearch, 
Cassandra, ActiveMQ, ...). These itests are manual (bootstrapping and check 
done by hand). But I think it could be possible to automatize.

Just my $0.01 ;)

Regards
JB

⁣

On Oct 19, 2016, 19:16, at 19:16, Thomas Weise <[email protected]> wrote:
>Kafka can be embedded for the integration testing, which should
>significantly simplify the setup.
>
>Here is an example I found:
>
>https://gist.github.com/fjavieralba/7930018
>
>Thanks,
>Thomas
>
>
>
>
>On Wed, Oct 19, 2016 at 9:44 AM, Dan Halperin
><[email protected]>
>wrote:
>
>> My thoughts:
>>
>> * It's worth reading the Beam testing
>> <http://beam.incubator.apache.org/contribute/testing/> document that
>Jason
>> Kuster wrote!
>> * Beam already has support for "End-to-end" integration tests, of
>examples
>> (e.g., WordCountIT
>> <https://github.com/apache/incubator-beam/blob/master/
>>
>examples/java/src/test/java/org/apache/beam/examples/WordCountIT.java>)
>> or data stores (e.g., BigtableReadIT
>> <https://github.com/apache/incubator-beam/blob/master/
>> sdks/java/io/google-cloud-platform/src/test/java/org/
>> apache/beam/sdk/io/gcp/bigtable/BigtableReadIT.java>
>> ).
>>
>> It sounds like we're is in agreement that we want one of these for
>KafkaIO,
>> too. (Among others,) This now circles back to other issues of testing
>> (which are not yet covered in that doc):
>>
>> * Unlike Bigtable, Kafka is not a cloud service. We'll either need a
>> permanent testing cluster or the ability to spin one up dynamically.
>That
>> work is hard, but we need to figure it out. (Note: the cluster needs
>to be
>> accessible from all runners and real clusters, so not just the local
>> machine like the integration tests.)
>> * Right now, only WordCountIT and a few other example ITs are run on
>all
>> runners. We need to add per-runner postcommit suites that run all ITs
>in
>> the project.
>>
>> Jason and several others of us have been thinking hard about the best
>ways
>> to build these tests and necessary test infrastructure. (See the
>> performance thread Jason started. IMO the most important issue to
>solve
>> first is infrastructure). Please help!
>>
>> Dan
>>
>> On Wed, Oct 19, 2016 at 7:37 AM, Thomas Weise <[email protected]> wrote:
>>
>> > +1 those are probably the most used sources. Hadoop FS has a number
>of
>> > different implementations, HDFS is one of them.
>> >
>> > On Wed, Oct 19, 2016 at 2:55 AM, Amit Sela <[email protected]>
>wrote:
>> >
>> > > I agree with Aljoscha about Kafka.
>> > >
>> > > How about having one integration test for BoundedSource and one
>for
>> > > UnboundedSource ? from apache perspective it makes sense to test
>this
>> > > end-to-end on HDFS and Kafka (respectively).
>> > >
>> > > On Wed, Oct 19, 2016 at 11:34 AM Aljoscha Krettek
><[email protected]
>> >
>> > > wrote:
>> > >
>> > > > +Jason, looping him in directly because he might have an
>opinion on
>> > what
>> > > > I'm going to say.
>> > > >
>> > > > Should we maybe add integration tests that verify that all
>runners
>> can
>> > > > correctly read from and write to an external system in a
>complete
>> > > Pipeline.
>> > > > At least for Kafka, which seems to be the most used option in
>the
>> > > > open-source space, this would make a lot of sense, IMHO.
>> > > >
>> > > > On Tue, 18 Oct 2016 at 21:49 Robert Bradshaw
>> > <[email protected]
>> > > >
>> > > > wrote:
>> > > >
>> > > > > Eventually we'll be able to communicate intent with the
>runner much
>> > > > > more directly via the ProcessContinuation object:
>> > > > >
>> > > > >
>> > > > >
>> > > > https://github.com/apache/incubator-beam/blob/
>> > > a0f649eaca8d8bd47d22db0ba7150fea1bf07975/sdks/java/core/src/
>> > > main/java/org/apache/beam/sdk/transforms/DoFn.java#L658
>> > > > >
>> > > > > On Tue, Oct 18, 2016 at 12:44 PM, Jean-Baptiste Onofré <
>> > > [email protected]>
>> > > > > wrote:
>> > > > > > Thanks for the update and summary.
>> > > > > >
>> > > > > > Regards
>> > > > > > JB
>> > > > > >
>> > > > > > ⁣
>> > > > > >
>> > > > > > On Oct 18, 2016, 20:47, at 20:47, Amit Sela <
>> [email protected]>
>> > > > > wrote:
>> > > > > >>I wanted to summarize here a couple of important points
>raised in
>> > > some
>> > > > > >>PRs
>> > > > > >>I was involved with, and while those PRs were about KafkaIO
>and
>> > > related
>> > > > > >>to
>> > > > > >>the Spark/Direct runners, some important notes were made
>and I
>> > think
>> > > > > >>they
>> > > > > >>are worth this thread.
>> > > > > >>
>> > > > > >>*Background*:
>> > > > > >>The KafkaIO waits (5 seconds) before starting to read, and
>(10
>> > > millis)
>> > > > > >>between advancing the reader, which is problematic for the
>Spark
>> > > runner
>> > > > > >>as
>> > > > > >>it might attempt to read (every microbatch) for a shorter
>period,
>> > and
>> > > > > >>so it
>> > > > > >>will never start.
>> > > > > >>Raghu Angadi mentioned in this conversation
>> > > > > >><https://github.com/apache/incubator-beam/pull/1071> that
>> > originally
>> > > > > >>the
>> > > > > >>wait period was set to 10 millis for both start() and
>advance(),
>> > but
>> > > it
>> > > > > >>was
>> > > > > >>changed mostly due to DirectRunner issues.
>> > > > > >>PR#1125
><https://github.com/apache/incubator-beam/pull/1125>
>> > > > originally
>> > > > > >>attempted to allow runners to configure those properties
>via
>> > > > > >>PipelineOptions, but Dan Halperin raised some interesting
>points:
>> > > > > >>
>> > > > > >>   - Readers should return as soon as they are able.
>> > > > > >> - Runners may poll advance() in a loop for a certain
>period of
>> > time
>> > > if
>> > > > > >>   it returned too fast.
>> > > > > >>   - Runners must tolerate sources that take a long time to
>start
>> > or
>> > > > > >>   advance, because real systems operate that way.
>> > > > > >>
>> > > > > >>Dan suggested (and I clearly agreed) that this should be
>> discussed
>> > > > > >>here.
>> > > > > >>
>> > > > > >>BTW, Thomas Groh mentioned that the DirectRunner is OK now
>with
>> > much
>> > > > > >>shorter wait periods, and PR#1125 now aims to re-set the 5
>> seconds
>> > > > > >>wait-on-start to 10 millis.
>> > > > > >>I think this is a good example of Dan's points here:  Kafka
>> reader
>> > > can
>> > > > > >>clearly return much faster (millis not seconds) and the
>> > DirectRunner
>> > > > > >>accommodates this now by reusing it's readers.
>> > > > > >>
>> > > > > >>I Hope I didn't forget/misunderstood anything (please
>correct me
>> > if I
>> > > > > >>did).
>> > > > > >>
>> > > > > >>Thanks,
>> > > > > >>Amit
>> > > > >
>> > > >
>> > >
>> >
>>

Re: [DISCUSS] Sources and Runners

Reply via email to