Re: Portable Flink runner: Generator source for testing

Łukasz Gajowy Sun, 07 Oct 2018 17:35:12 -0700

Hi,

just to clarify, judging from the above snippets: it seems that we are able
now to run tests that use a native source for data generation and use them
in this form until the Timers are supported. When Timers are there, we
should consider switching to the Impulse + PTransform based solution
(described above) because it's more portable - the current is dedicated to
Flink only (which still is really cool). Is this correct or am I missing
something?


Łukasz

pt., 5 paź 2018 o 14:04 Maximilian Michels <[email protected]> napisał(a):

> Thanks for sharing your setup. You're right that we need timers to
> continuously ingest data to the testing pipeline.
>
> Here is the Flink source which generates the data:
>
> https://github.com/mwylde/beam/commit/09c62991773c749bc037cc2b6044896e2d34988a#diff-b2fc8d680d9c1da86ba23345f3bc83d4R42
>
> On 04.10.18 19:31, Thomas Weise wrote:
> > FYI here is an example with native generator for portable Flink runner:
> >
> > https://github.com/mwylde/beam/tree/micah_memory_leak
> >
> https://github.com/mwylde/beam/blob/22f7099b071e65a76110ecc5beda0636ca07e101/sdks/python/apache_beam/examples/streaming_leak.py
> >
> > You can use it to run the portable Flink runner in streaming mode
> > continuously for testing purposes.
> >
> >
> > On Mon, Oct 1, 2018 at 9:50 AM Thomas Weise <[email protected]
> > <mailto:[email protected]>> wrote:
> >
> >
> >
> >     On Mon, Oct 1, 2018 at 8:29 AM Maximilian Michels <[email protected]
> >     <mailto:[email protected]>> wrote:
> >
> >          > and then have Flink manage the parallelism for stages
> >         downstream from that?@Pablo Can you clarify what you mean by
> that?
> >
> >         Let me paraphrase this just to get a clear understanding. There
> >         are two
> >         approaches to test portable streaming pipelines:
> >
> >         a) Use an Impulse followed by a test PTransform which generates
> >         testing
> >         data. This is similar to how streaming sources work which don't
> >         use the
> >         Read Transform. For basic testing this should work, even without
> >         support
> >         for Timers.
> >
> >
> >     AFAIK this works for bounded sources and batch mode of the Flink
> >     runner (staged execution).
> >
> >     For streaming we need small bundles, we cannot have a Python ParDo
> >     block to emit records periodically.
> >
> >     (With timers, the ParDo wouldn't block but instead schedule itself
> >     as needed.)
> >
> >         b) Introduce a new URN which gets translated to a native
> >         Flink/Spark/xy
> >         testing transform.
> >
> >         We should go for a) as this will make testing easier across
> >         portable
> >         runners. We previously discussed native transforms will be an
> >         option in
> >         Beam, but it would be preferable to leave them out of testing
> >         for now.
> >
> >         Thanks,
> >         Max
> >
> >
> >         On 28.09.18 21:14, Thomas Weise wrote:
> >          > Thanks for sharing the link, this looks very promising!
> >          >
> >          > For the synthetic source, if we need a runner native trigger
> >         mechanism,
> >          > then it should probably just emit an empty byte array like
> >         the impulse
> >          > implementation does, and everything else could be left to SDK
> >         specific
> >          > transforms that are downstream. We don't have support for
> >         timers in the
> >          > portable Flink runner yet. With timers, there would not be
> >         the need for
> >          > a runner native URN and it could work just like Pablo
> described.
> >          >
> >          >
> >          > On Fri, Sep 28, 2018 at 3:09 AM Łukasz Gajowy
> >         <[email protected] <mailto:[email protected]>
> >          > <mailto:[email protected]
> >         <mailto:[email protected]>>> wrote:
> >          >
> >          >     Hi all,
> >          >
> >          >     thank you, Thomas, for starting this discussion and Pablo
> for
> >          >     sharing the ideas. FWIW adding here, we discussed this in
> >         terms of
> >          >     Core Beam Transform Load Tests that we are working on
> >         right now [1].
> >          >     If generating synthetic data will be possible for
> >         portable streaming
> >          >     pipelines, we could use it in our work to test Python
> >         streaming
> >          >     scenarios.
> >          >
> >          >     [1] _https://s.apache.org/GVMa_
> >          >
> >          >     pt., 28 wrz 2018 o 08:18 Pablo Estrada
> >         <[email protected] <mailto:[email protected]>
> >          >     <mailto:[email protected] <mailto:[email protected]>>>
> >         napisał(a):
> >          >
> >          >         Hi Thomas, all,
> >          >         yes, this is quite important for testing, and in fact
> >         I'd think
> >          >         it's important to streamline the insertion of native
> >         sources
> >          >         from different runners to make the current runners
> >         more usable.
> >          >         But that's another topic.
> >          >
> >          >         For generators of synthetic data, I had a couple
> >         ideas (and this
> >          >         will show my limited knowledge about Flink and
> >         Streaming, but oh
> >          >         well):
> >          >
> >          >         - Flink experts: Is it possible to add a pure-Beam
> >         generator
> >          >         that will do something like: Impulse ->
> >         ParDo(generate multiple
> >          >         elements) -> Forced "Write" to Flink (e.g. something
> >         like a
> >          >         reshuffle), and then have Flink manage the
> >         parallelism for
> >          >         stages downstream from that?
> >          >
> >          >         - If this is not possible, it may be worth writing
> some
> >          >         transform in Flink / other runners that can be
> >         plugged in by
> >          >         inserting a custom URN. In fact, it may be a good
> idea to
> >          >         streamline the insertion of native sources for each
> >         runner based
> >          >         on some sort of CustomURNTransform() ?
> >          >
> >          >         I hope I did not butcher those explanations too
> badly...
> >          >         Best
> >          >         -P.
> >          >
> >          >         On Thu, Sep 27, 2018, 5:55 PM Thomas Weise
> >         <[email protected] <mailto:[email protected]>
> >          >         <mailto:[email protected] <mailto:[email protected]>>>
> wrote:
> >          >
> >          >             There were a few discussions how we can
> >         facilitate testing
> >          >             for portable streaming pipelines with the Flink
> >         runner. The
> >          >             problem is that we currently don't have streaming
> >         sources in
> >          >             the Python SDK.
> >          >
> >          >             One way to support testing could be a generator
> >         that extends
> >          >             the idea of Impulse to provide a Flink native
> trigger
> >          >             transform, optionally parameterized with an
> >         interval and max
> >          >             count.
> >          >
> >          >             Test pipelines could then follow the generator
> >         with a Map
> >          >             function that creates whatever payloads are
> >         desirable.
> >          >
> >          >             Thoughts?
> >          >
> >          >             Thanks,
> >          >             Thomas
> >          >
> >
>

Re: Portable Flink runner: Generator source for testing

Reply via email to