Hadoop FS has the local file system implementation that can be used for
testing ("file" URL, no service needed).Thanks On Wed, Oct 19, 2016 at 10:43 AM, Amit Sela <[email protected]> wrote: > Oh cool, that didn't exist in 0.8 I think, but anything that is Kafka > native is best. > I'm pretty sure there's an embedded HDFS for testing as well. > > While embedded Kafka/HDFS won't reflect "real-life" distributed > environment, it could be a good place to start and provide some basic > functional testing. > > On Wed, Oct 19, 2016 at 8:25 PM Satish Duggana <[email protected]> wrote: > > > > > https://github.com/apache/kafka/blob/trunk/streams/src/ > test/java/org/apache/kafka/streams/integration/utils/ > EmbeddedKafkaCluster.java > > > > This is currently used in one of our repos and it comes as part of one of > > kafka libs. > > > > On Wed, Oct 19, 2016 at 10:49 PM, Amit Sela <[email protected]> > wrote: > > > > > The SparkRunner actually has an embedded Kafka for its unit tests. > > > > > > On Wed, Oct 19, 2016, 20:16 Thomas Weise <[email protected]> wrote: > > > > > > > Kafka can be embedded for the integration testing, which should > > > > significantly simplify the setup. > > > > > > > > Here is an example I found: > > > > > > > > https://gist.github.com/fjavieralba/7930018 > > > > > > > > Thanks, > > > > Thomas > > > > > > > > > > > > > > > > > > > > On Wed, Oct 19, 2016 at 9:44 AM, Dan Halperin > > > <[email protected] > > > > > > > > > wrote: > > > > > > > > > My thoughts: > > > > > > > > > > * It's worth reading the Beam testing > > > > > <http://beam.incubator.apache.org/contribute/testing/> document > that > > > > Jason > > > > > Kuster wrote! > > > > > * Beam already has support for "End-to-end" integration tests, of > > > > examples > > > > > (e.g., WordCountIT > > > > > <https://github.com/apache/incubator-beam/blob/master/ > > > > > examples/java/src/test/java/org/apache/beam/examples/ > > > WordCountIT.java>) > > > > > or data stores (e.g., BigtableReadIT > > > > > <https://github.com/apache/incubator-beam/blob/master/ > > > > > sdks/java/io/google-cloud-platform/src/test/java/org/ > > > > > apache/beam/sdk/io/gcp/bigtable/BigtableReadIT.java> > > > > > ). > > > > > > > > > > It sounds like we're is in agreement that we want one of these for > > > > KafkaIO, > > > > > too. (Among others,) This now circles back to other issues of > testing > > > > > (which are not yet covered in that doc): > > > > > > > > > > * Unlike Bigtable, Kafka is not a cloud service. We'll either need > a > > > > > permanent testing cluster or the ability to spin one up > dynamically. > > > That > > > > > work is hard, but we need to figure it out. (Note: the cluster > needs > > to > > > > be > > > > > accessible from all runners and real clusters, so not just the > local > > > > > machine like the integration tests.) > > > > > * Right now, only WordCountIT and a few other example ITs are run > on > > > all > > > > > runners. We need to add per-runner postcommit suites that run all > ITs > > > in > > > > > the project. > > > > > > > > > > Jason and several others of us have been thinking hard about the > best > > > > ways > > > > > to build these tests and necessary test infrastructure. (See the > > > > > performance thread Jason started. IMO the most important issue to > > solve > > > > > first is infrastructure). Please help! > > > > > > > > > > Dan > > > > > > > > > > On Wed, Oct 19, 2016 at 7:37 AM, Thomas Weise <[email protected]> > > wrote: > > > > > > > > > > > +1 those are probably the most used sources. Hadoop FS has a > number > > > of > > > > > > different implementations, HDFS is one of them. > > > > > > > > > > > > On Wed, Oct 19, 2016 at 2:55 AM, Amit Sela <[email protected] > > > > > > wrote: > > > > > > > > > > > > > I agree with Aljoscha about Kafka. > > > > > > > > > > > > > > How about having one integration test for BoundedSource and one > > for > > > > > > > UnboundedSource ? from apache perspective it makes sense to > test > > > this > > > > > > > end-to-end on HDFS and Kafka (respectively). > > > > > > > > > > > > > > On Wed, Oct 19, 2016 at 11:34 AM Aljoscha Krettek < > > > > [email protected] > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > +Jason, looping him in directly because he might have an > > opinion > > > on > > > > > > what > > > > > > > > I'm going to say. > > > > > > > > > > > > > > > > Should we maybe add integration tests that verify that all > > > runners > > > > > can > > > > > > > > correctly read from and write to an external system in a > > complete > > > > > > > Pipeline. > > > > > > > > At least for Kafka, which seems to be the most used option in > > the > > > > > > > > open-source space, this would make a lot of sense, IMHO. > > > > > > > > > > > > > > > > On Tue, 18 Oct 2016 at 21:49 Robert Bradshaw > > > > > > <[email protected] > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Eventually we'll be able to communicate intent with the > > runner > > > > much > > > > > > > > > more directly via the ProcessContinuation object: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/incubator-beam/blob/ > > > > > > > a0f649eaca8d8bd47d22db0ba7150fea1bf07975/sdks/java/core/src/ > > > > > > > main/java/org/apache/beam/sdk/transforms/DoFn.java#L658 > > > > > > > > > > > > > > > > > > On Tue, Oct 18, 2016 at 12:44 PM, Jean-Baptiste Onofré < > > > > > > > [email protected]> > > > > > > > > > wrote: > > > > > > > > > > Thanks for the update and summary. > > > > > > > > > > > > > > > > > > > > Regards > > > > > > > > > > JB > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Oct 18, 2016, 20:47, at 20:47, Amit Sela < > > > > > [email protected]> > > > > > > > > > wrote: > > > > > > > > > >>I wanted to summarize here a couple of important points > > > raised > > > > in > > > > > > > some > > > > > > > > > >>PRs > > > > > > > > > >>I was involved with, and while those PRs were about > KafkaIO > > > and > > > > > > > related > > > > > > > > > >>to > > > > > > > > > >>the Spark/Direct runners, some important notes were made > > and > > > I > > > > > > think > > > > > > > > > >>they > > > > > > > > > >>are worth this thread. > > > > > > > > > >> > > > > > > > > > >>*Background*: > > > > > > > > > >>The KafkaIO waits (5 seconds) before starting to read, > and > > > (10 > > > > > > > millis) > > > > > > > > > >>between advancing the reader, which is problematic for > the > > > > Spark > > > > > > > runner > > > > > > > > > >>as > > > > > > > > > >>it might attempt to read (every microbatch) for a shorter > > > > period, > > > > > > and > > > > > > > > > >>so it > > > > > > > > > >>will never start. > > > > > > > > > >>Raghu Angadi mentioned in this conversation > > > > > > > > > >><https://github.com/apache/incubator-beam/pull/1071> > that > > > > > > originally > > > > > > > > > >>the > > > > > > > > > >>wait period was set to 10 millis for both start() and > > > > advance(), > > > > > > but > > > > > > > it > > > > > > > > > >>was > > > > > > > > > >>changed mostly due to DirectRunner issues. > > > > > > > > > >>PR#1125 < > > https://github.com/apache/incubator-beam/pull/1125> > > > > > > > > originally > > > > > > > > > >>attempted to allow runners to configure those properties > > via > > > > > > > > > >>PipelineOptions, but Dan Halperin raised some interesting > > > > points: > > > > > > > > > >> > > > > > > > > > >> - Readers should return as soon as they are able. > > > > > > > > > >> - Runners may poll advance() in a loop for a certain > > period > > > of > > > > > > time > > > > > > > if > > > > > > > > > >> it returned too fast. > > > > > > > > > >> - Runners must tolerate sources that take a long time > to > > > > start > > > > > > or > > > > > > > > > >> advance, because real systems operate that way. > > > > > > > > > >> > > > > > > > > > >>Dan suggested (and I clearly agreed) that this should be > > > > > discussed > > > > > > > > > >>here. > > > > > > > > > >> > > > > > > > > > >>BTW, Thomas Groh mentioned that the DirectRunner is OK > now > > > with > > > > > > much > > > > > > > > > >>shorter wait periods, and PR#1125 now aims to re-set the > 5 > > > > > seconds > > > > > > > > > >>wait-on-start to 10 millis. > > > > > > > > > >>I think this is a good example of Dan's points here: > Kafka > > > > > reader > > > > > > > can > > > > > > > > > >>clearly return much faster (millis not seconds) and the > > > > > > DirectRunner > > > > > > > > > >>accommodates this now by reusing it's readers. > > > > > > > > > >> > > > > > > > > > >>I Hope I didn't forget/misunderstood anything (please > > correct > > > > me > > > > > > if I > > > > > > > > > >>did). > > > > > > > > > >> > > > > > > > > > >>Thanks, > > > > > > > > > >>Amit > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
