[jira] [Commented] (BEAM-3323) Create a generator of finite-but-unbounded PCollection's for integration testing

Kenneth Knowles (JIRA) Mon, 11 Dec 2017 16:23:16 -0800

    [ 
https://issues.apache.org/jira/browse/BEAM-3323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16286837#comment-16286837
 ]


Kenneth Knowles commented on BEAM-3323:
---------------------------------------

+1 to allowing to just explicitly make finite unbounded PCollections. That's 
the lowest hanging fruit.

Beyond that, it would be more accurate that TestStream is only _supported_ by 
the direct runner. To the extent implementation in distributed runners is 
infeasible, we should open a design discussion.

TestStream makes some nondeterministic things deterministic so they are 
testable directly. The obvious alternative is to not test those details (like 
whether an element showed up before/after a watermark-driven timer) directly, 
but only test properties that should be true for all instantiations of the 
nondeterminism. That is all you can do with GenerateSequence alone. It is great 
to have such properties available, but most useful to weight testing towards 
corner cases.

At scale, interacting with a real deployment of an IO endpoint, our PKB 
framework would be used, right?

> Create a generator of finite-but-unbounded PCollection's for integration 
> testing
> --------------------------------------------------------------------------------
>
>                 Key: BEAM-3323
>                 URL: https://issues.apache.org/jira/browse/BEAM-3323
>             Project: Beam
>          Issue Type: New Feature
>          Components: sdk-java-core
>            Reporter: Eugene Kirpichov
>            Assignee: Kenneth Knowles
>
> Several IOs have features that exhibit nontrivial behavior when writing 
> unbounded PCollection's - e.g. WriteFiles with windowed writes; BigQueryIO. 
> We need to be able to write integration tests for these features.
> Currently we have two ways to generate an unbounded PCollection without 
> reading from a real-world external streaming system such as pubsub or kafka:
> 1) TestStream, which only works in direct runner - sufficient for some tests 
> but not all: definitely not sufficient for large-scale tests or for tests 
> that need to interact with a real instance of the external system (e.g. 
> BigQueryIO). It is also quite verbose to use.
> 2) GenerateSequence.from(0) without a .to(), which returns an infinite amount 
> of data.
> GenerateSequence.from(a).to(b) returns a finite amount of data, but returns 
> it as a bounded PCollection, and doesn't report the watermark.
> I think the right thing to do here, for now, is to make 
> GenerateSequence.from(a).to(b) have an option (e.g. ".asUnbounded()", where 
> it will return an unbounded PCollection, go through UnboundedSource (or 
> potentially via SDF in runners that support it), and track the watermark 
> properly (or via a configurable watermark fn).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (BEAM-3323) Create a generator of finite-but-unbounded PCollection's for integration testing

Reply via email to