I agree with Aljoscha about Kafka. How about having one integration test for BoundedSource and one for UnboundedSource ? from apache perspective it makes sense to test this end-to-end on HDFS and Kafka (respectively).
On Wed, Oct 19, 2016 at 11:34 AM Aljoscha Krettek <aljos...@apache.org> wrote: > +Jason, looping him in directly because he might have an opinion on what > I'm going to say. > > Should we maybe add integration tests that verify that all runners can > correctly read from and write to an external system in a complete Pipeline. > At least for Kafka, which seems to be the most used option in the > open-source space, this would make a lot of sense, IMHO. > > On Tue, 18 Oct 2016 at 21:49 Robert Bradshaw <rober...@google.com.invalid> > wrote: > > > Eventually we'll be able to communicate intent with the runner much > > more directly via the ProcessContinuation object: > > > > > > > https://github.com/apache/incubator-beam/blob/a0f649eaca8d8bd47d22db0ba7150fea1bf07975/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/DoFn.java#L658 > > > > On Tue, Oct 18, 2016 at 12:44 PM, Jean-Baptiste Onofré <j...@nanthrax.net> > > wrote: > > > Thanks for the update and summary. > > > > > > Regards > > > JB > > > > > > > > > > > > On Oct 18, 2016, 20:47, at 20:47, Amit Sela <amitsel...@gmail.com> > > wrote: > > >>I wanted to summarize here a couple of important points raised in some > > >>PRs > > >>I was involved with, and while those PRs were about KafkaIO and related > > >>to > > >>the Spark/Direct runners, some important notes were made and I think > > >>they > > >>are worth this thread. > > >> > > >>*Background*: > > >>The KafkaIO waits (5 seconds) before starting to read, and (10 millis) > > >>between advancing the reader, which is problematic for the Spark runner > > >>as > > >>it might attempt to read (every microbatch) for a shorter period, and > > >>so it > > >>will never start. > > >>Raghu Angadi mentioned in this conversation > > >><https://github.com/apache/incubator-beam/pull/1071> that originally > > >>the > > >>wait period was set to 10 millis for both start() and advance(), but it > > >>was > > >>changed mostly due to DirectRunner issues. > > >>PR#1125 <https://github.com/apache/incubator-beam/pull/1125> > originally > > >>attempted to allow runners to configure those properties via > > >>PipelineOptions, but Dan Halperin raised some interesting points: > > >> > > >> - Readers should return as soon as they are able. > > >> - Runners may poll advance() in a loop for a certain period of time if > > >> it returned too fast. > > >> - Runners must tolerate sources that take a long time to start or > > >> advance, because real systems operate that way. > > >> > > >>Dan suggested (and I clearly agreed) that this should be discussed > > >>here. > > >> > > >>BTW, Thomas Groh mentioned that the DirectRunner is OK now with much > > >>shorter wait periods, and PR#1125 now aims to re-set the 5 seconds > > >>wait-on-start to 10 millis. > > >>I think this is a good example of Dan's points here: Kafka reader can > > >>clearly return much faster (millis not seconds) and the DirectRunner > > >>accommodates this now by reusing it's readers. > > >> > > >>I Hope I didn't forget/misunderstood anything (please correct me if I > > >>did). > > >> > > >>Thanks, > > >>Amit > > >