Eugene Kirpichov created BEAM-3323:
--------------------------------------

             Summary: Create a generator of finite-but-unbounded PCollection's 
for integration testing
                 Key: BEAM-3323
                 URL: https://issues.apache.org/jira/browse/BEAM-3323
             Project: Beam
          Issue Type: New Feature
          Components: sdk-java-core
            Reporter: Eugene Kirpichov
            Assignee: Kenneth Knowles


Several IOs have features that exhibit nontrivial behavior when writing 
unbounded PCollection's - e.g. WriteFiles with windowed writes; BigQueryIO. We 
need to be able to write integration tests for these features.

Currently we have two ways to generate an unbounded PCollection without reading 
from a real-world external streaming system such as pubsub or kafka:

1) TestStream, which only works in direct runner - sufficient for some tests 
but not all: definitely not sufficient for large-scale tests or for tests that 
need to interact with a real instance of the external system (e.g. BigQueryIO). 
It is also quite verbose to use.
2) GenerateSequence.from(0) without a .to(), which returns an infinite amount 
of data.

GenerateSequence.from(a).to(b) returns a finite amount of data, but returns it 
as a bounded PCollection, and doesn't report the watermark.

I think the right thing to do here, for now, is to make 
GenerateSequence.from(a).to(b) have an option (e.g. ".asUnbounded()", where it 
will return an unbounded PCollection, go through UnboundedSource (or 
potentially via SDF in runners that support it), and track the watermark 
properly (or via a configurable watermark fn).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to