Re: Portable Flink runner: Generator source for testing

Thomas Weise Thu, 04 Oct 2018 10:32:05 -0700

FYI here is an example with native generator for portable Flink runner:

https://github.com/mwylde/beam/tree/micah_memory_leak
https://github.com/mwylde/beam/blob/22f7099b071e65a76110ecc5beda0636ca07e101/sdks/python/apache_beam/examples/streaming_leak.py


You can use it to run the portable Flink runner in streaming mode
continuously for testing purposes.


On Mon, Oct 1, 2018 at 9:50 AM Thomas Weise <[email protected]> wrote:

>
>
> On Mon, Oct 1, 2018 at 8:29 AM Maximilian Michels <[email protected]> wrote:
>
>> > and then have Flink manage the parallelism for stages downstream from
>> that?@Pablo Can you clarify what you mean by that?
>>
>> Let me paraphrase this just to get a clear understanding. There are two
>> approaches to test portable streaming pipelines:
>>
>> a) Use an Impulse followed by a test PTransform which generates testing
>> data. This is similar to how streaming sources work which don't use the
>> Read Transform. For basic testing this should work, even without support
>> for Timers.
>>
>
> AFAIK this works for bounded sources and batch mode of the Flink runner
> (staged execution).
>
> For streaming we need small bundles, we cannot have a Python ParDo block
> to emit records periodically.
>
> (With timers, the ParDo wouldn't block but instead schedule itself as
> needed.)
>
> b) Introduce a new URN which gets translated to a native Flink/Spark/xy
>> testing transform.
>>
>> We should go for a) as this will make testing easier across portable
>> runners. We previously discussed native transforms will be an option in
>> Beam, but it would be preferable to leave them out of testing for now.
>>
>> Thanks,
>> Max
>>
>>
>> On 28.09.18 21:14, Thomas Weise wrote:
>> > Thanks for sharing the link, this looks very promising!
>> >
>> > For the synthetic source, if we need a runner native trigger mechanism,
>> > then it should probably just emit an empty byte array like the impulse
>> > implementation does, and everything else could be left to SDK specific
>> > transforms that are downstream. We don't have support for timers in the
>> > portable Flink runner yet. With timers, there would not be the need for
>> > a runner native URN and it could work just like Pablo described.
>> >
>> >
>> > On Fri, Sep 28, 2018 at 3:09 AM Łukasz Gajowy <[email protected]
>> > <mailto:[email protected]>> wrote:
>> >
>> >     Hi all,
>> >
>> >     thank you, Thomas, for starting this discussion and Pablo for
>> >     sharing the ideas. FWIW adding here, we discussed this in terms of
>> >     Core Beam Transform Load Tests that we are working on right now [1].
>> >     If generating synthetic data will be possible for portable streaming
>> >     pipelines, we could use it in our work to test Python streaming
>> >     scenarios.
>> >
>> >     [1] _https://s.apache.org/GVMa_
>> >
>> >     pt., 28 wrz 2018 o 08:18 Pablo Estrada <[email protected]
>> >     <mailto:[email protected]>> napisał(a):
>> >
>> >         Hi Thomas, all,
>> >         yes, this is quite important for testing, and in fact I'd think
>> >         it's important to streamline the insertion of native sources
>> >         from different runners to make the current runners more usable.
>> >         But that's another topic.
>> >
>> >         For generators of synthetic data, I had a couple ideas (and this
>> >         will show my limited knowledge about Flink and Streaming, but oh
>> >         well):
>> >
>> >         - Flink experts: Is it possible to add a pure-Beam generator
>> >         that will do something like: Impulse -> ParDo(generate multiple
>> >         elements) -> Forced "Write" to Flink (e.g. something like a
>> >         reshuffle), and then have Flink manage the parallelism for
>> >         stages downstream from that?
>> >
>> >         - If this is not possible, it may be worth writing some
>> >         transform in Flink / other runners that can be plugged in by
>> >         inserting a custom URN. In fact, it may be a good idea to
>> >         streamline the insertion of native sources for each runner based
>> >         on some sort of CustomURNTransform() ?
>> >
>> >         I hope I did not butcher those explanations too badly...
>> >         Best
>> >         -P.
>> >
>> >         On Thu, Sep 27, 2018, 5:55 PM Thomas Weise <[email protected]
>> >         <mailto:[email protected]>> wrote:
>> >
>> >             There were a few discussions how we can facilitate testing
>> >             for portable streaming pipelines with the Flink runner. The
>> >             problem is that we currently don't have streaming sources in
>> >             the Python SDK.
>> >
>> >             One way to support testing could be a generator that extends
>> >             the idea of Impulse to provide a Flink native trigger
>> >             transform, optionally parameterized with an interval and max
>> >             count.
>> >
>> >             Test pipelines could then follow the generator with a Map
>> >             function that creates whatever payloads are desirable.
>> >
>> >             Thoughts?
>> >
>> >             Thanks,
>> >             Thomas
>> >
>>
>

Re: Portable Flink runner: Generator source for testing

Reply via email to