[
https://issues.apache.org/jira/browse/BEAM-3323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16286837#comment-16286837
]
Kenneth Knowles commented on BEAM-3323:
---------------------------------------
+1 to allowing to just explicitly make finite unbounded PCollections. That's
the lowest hanging fruit.
Beyond that, it would be more accurate that TestStream is only _supported_ by
the direct runner. To the extent implementation in distributed runners is
infeasible, we should open a design discussion.
TestStream makes some nondeterministic things deterministic so they are
testable directly. The obvious alternative is to not test those details (like
whether an element showed up before/after a watermark-driven timer) directly,
but only test properties that should be true for all instantiations of the
nondeterminism. That is all you can do with GenerateSequence alone. It is great
to have such properties available, but most useful to weight testing towards
corner cases.
At scale, interacting with a real deployment of an IO endpoint, our PKB
framework would be used, right?
> Create a generator of finite-but-unbounded PCollection's for integration
> testing
> --------------------------------------------------------------------------------
>
> Key: BEAM-3323
> URL: https://issues.apache.org/jira/browse/BEAM-3323
> Project: Beam
> Issue Type: New Feature
> Components: sdk-java-core
> Reporter: Eugene Kirpichov
> Assignee: Kenneth Knowles
>
> Several IOs have features that exhibit nontrivial behavior when writing
> unbounded PCollection's - e.g. WriteFiles with windowed writes; BigQueryIO.
> We need to be able to write integration tests for these features.
> Currently we have two ways to generate an unbounded PCollection without
> reading from a real-world external streaming system such as pubsub or kafka:
> 1) TestStream, which only works in direct runner - sufficient for some tests
> but not all: definitely not sufficient for large-scale tests or for tests
> that need to interact with a real instance of the external system (e.g.
> BigQueryIO). It is also quite verbose to use.
> 2) GenerateSequence.from(0) without a .to(), which returns an infinite amount
> of data.
> GenerateSequence.from(a).to(b) returns a finite amount of data, but returns
> it as a bounded PCollection, and doesn't report the watermark.
> I think the right thing to do here, for now, is to make
> GenerateSequence.from(a).to(b) have an option (e.g. ".asUnbounded()", where
> it will return an unbounded PCollection, go through UnboundedSource (or
> potentially via SDF in runners that support it), and track the watermark
> properly (or via a configurable watermark fn).
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)