[ 
https://issues.apache.org/jira/browse/BEAM-3323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16739453#comment-16739453
 ] 

Lukasz Gajowy commented on BEAM-3323:
-------------------------------------

Currently, we have both UnboundedSyntheticSource that generates (in a 
configurable way) a PCollection<KV<byte[], byte[]> of deterministic data ([link 
to source|https://github.com/apache/beam/tree/master/sdks/java/io/synthetic]). 
Maybe it could be used for the above-mentioned purposes?

CC: [~kenn], [~iemejia] 

> Create a generator of finite-but-unbounded PCollection's for integration 
> testing
> --------------------------------------------------------------------------------
>
>                 Key: BEAM-3323
>                 URL: https://issues.apache.org/jira/browse/BEAM-3323
>             Project: Beam
>          Issue Type: New Feature
>          Components: sdk-java-core
>            Reporter: Eugene Kirpichov
>            Priority: Major
>
> Several IOs have features that exhibit nontrivial behavior when writing 
> unbounded PCollection's - e.g. WriteFiles with windowed writes; BigQueryIO. 
> We need to be able to write integration tests for these features.
> Currently we have two ways to generate an unbounded PCollection without 
> reading from a real-world external streaming system such as pubsub or kafka:
> 1) TestStream, which only works in direct runner - sufficient for some tests 
> but not all: definitely not sufficient for large-scale tests or for tests 
> that need to interact with a real instance of the external system (e.g. 
> BigQueryIO). It is also quite verbose to use.
> 2) GenerateSequence.from(0) without a .to(), which returns an infinite amount 
> of data.
> GenerateSequence.from(a).to(b) returns a finite amount of data, but returns 
> it as a bounded PCollection, and doesn't report the watermark.
> I think the right thing to do here, for now, is to make 
> GenerateSequence.from(a).to(b) have an option (e.g. ".asUnbounded()", where 
> it will return an unbounded PCollection, go through UnboundedSource (or 
> potentially via SDF in runners that support it), and track the watermark 
> properly (or via a configurable watermark fn).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to