Re: Portable Flink runner: Generator source for testing

Thomas Weise Mon, 08 Oct 2018 09:43:41 -0700

The portable runner does not support metrics yet:
https://s.apache.org/apache-beam-portability-support-table


There is also no JIRA referenced in the table, would be good to
locate/create it.

On Mon, Oct 8, 2018 at 9:11 AM Łukasz Gajowy <[email protected]>
wrote:

> Does anyone know what is the status of metrics support for Flink Portable
> Runner? I think we need them to be used in such tests to at least collect
> time metric that does not contain cluster warm up time, staging resources
> time and other things that can disturb the actual run time metric. We
> probably should use the metrics API in some other places (as described in
> the above-mentioned proposal).
>
>
>
> pon., 8 paź 2018 o 12:12 Maximilian Michels <[email protected]> napisał(a):
>
>> This is correct. However, the example code is only part of Lyft's code
>> base. Until timer support is done, we would have to do something similar
>> in our code base.
>>
>> On 08.10.18 02:34, Łukasz Gajowy wrote:
>> > Hi,
>> >
>> > just to clarify, judging from the above snippets: it seems that we are
>> > able now to run tests that use a native source for data generation and
>> > use them in this form until the Timers are supported. When Timers are
>> > there, we should consider switching to the Impulse + PTransform based
>> > solution (described above) because it's more portable - the current is
>> > dedicated to Flink only (which still is really cool). Is this correct
>> or
>> > am I missing something?
>> >
>> > Łukasz
>> >
>> > pt., 5 paź 2018 o 14:04 Maximilian Michels <[email protected]
>> > <mailto:[email protected]>> napisał(a):
>> >
>> >     Thanks for sharing your setup. You're right that we need timers to
>> >     continuously ingest data to the testing pipeline.
>> >
>> >     Here is the Flink source which generates the data:
>> >
>> https://github.com/mwylde/beam/commit/09c62991773c749bc037cc2b6044896e2d34988a#diff-b2fc8d680d9c1da86ba23345f3bc83d4R42
>> >
>> >     On 04.10.18 19:31, Thomas Weise wrote:
>> >      > FYI here is an example with native generator for portable Flink
>> >     runner:
>> >      >
>> >      > https://github.com/mwylde/beam/tree/micah_memory_leak
>> >      >
>> >
>> https://github.com/mwylde/beam/blob/22f7099b071e65a76110ecc5beda0636ca07e101/sdks/python/apache_beam/examples/streaming_leak.py
>> >      >
>> >      > You can use it to run the portable Flink runner in streaming mode
>> >      > continuously for testing purposes.
>> >      >
>> >      >
>> >      > On Mon, Oct 1, 2018 at 9:50 AM Thomas Weise <[email protected]
>> >     <mailto:[email protected]>
>> >      > <mailto:[email protected] <mailto:[email protected]>>> wrote:
>> >      >
>> >      >
>> >      >
>> >      >     On Mon, Oct 1, 2018 at 8:29 AM Maximilian Michels
>> >     <[email protected] <mailto:[email protected]>
>> >      >     <mailto:[email protected] <mailto:[email protected]>>> wrote:
>> >      >
>> >      >          > and then have Flink manage the parallelism for stages
>> >      >         downstream from that?@Pablo Can you clarify what you mean
>> >     by that?
>> >      >
>> >      >         Let me paraphrase this just to get a clear understanding.
>> >     There
>> >      >         are two
>> >      >         approaches to test portable streaming pipelines:
>> >      >
>> >      >         a) Use an Impulse followed by a test PTransform which
>> >     generates
>> >      >         testing
>> >      >         data. This is similar to how streaming sources work which
>> >     don't
>> >      >         use the
>> >      >         Read Transform. For basic testing this should work, even
>> >     without
>> >      >         support
>> >      >         for Timers.
>> >      >
>> >      >
>> >      >     AFAIK this works for bounded sources and batch mode of the
>> Flink
>> >      >     runner (staged execution).
>> >      >
>> >      >     For streaming we need small bundles, we cannot have a Python
>> >     ParDo
>> >      >     block to emit records periodically.
>> >      >
>> >      >     (With timers, the ParDo wouldn't block but instead schedule
>> >     itself
>> >      >     as needed.)
>> >      >
>> >      >         b) Introduce a new URN which gets translated to a native
>> >      >         Flink/Spark/xy
>> >      >         testing transform.
>> >      >
>> >      >         We should go for a) as this will make testing easier
>> across
>> >      >         portable
>> >      >         runners. We previously discussed native transforms will
>> be an
>> >      >         option in
>> >      >         Beam, but it would be preferable to leave them out of
>> testing
>> >      >         for now.
>> >      >
>> >      >         Thanks,
>> >      >         Max
>> >      >
>> >      >
>> >      >         On 28.09.18 21:14, Thomas Weise wrote:
>> >      >          > Thanks for sharing the link, this looks very
>> promising!
>> >      >          >
>> >      >          > For the synthetic source, if we need a runner native
>> >     trigger
>> >      >         mechanism,
>> >      >          > then it should probably just emit an empty byte array
>> like
>> >      >         the impulse
>> >      >          > implementation does, and everything else could be left
>> >     to SDK
>> >      >         specific
>> >      >          > transforms that are downstream. We don't have support
>> for
>> >      >         timers in the
>> >      >          > portable Flink runner yet. With timers, there would
>> not be
>> >      >         the need for
>> >      >          > a runner native URN and it could work just like Pablo
>> >     described.
>> >      >          >
>> >      >          >
>> >      >          > On Fri, Sep 28, 2018 at 3:09 AM Łukasz Gajowy
>> >      >         <[email protected] <mailto:[email protected]
>> >
>> >     <mailto:[email protected] <mailto:[email protected]>>
>> >      >          > <mailto:[email protected]
>> >     <mailto:[email protected]>
>> >      >         <mailto:[email protected]
>> >     <mailto:[email protected]>>>> wrote:
>> >      >          >
>> >      >          >     Hi all,
>> >      >          >
>> >      >          >     thank you, Thomas, for starting this discussion
>> >     and Pablo for
>> >      >          >     sharing the ideas. FWIW adding here, we discussed
>> >     this in
>> >      >         terms of
>> >      >          >     Core Beam Transform Load Tests that we are
>> working on
>> >      >         right now [1].
>> >      >          >     If generating synthetic data will be possible for
>> >      >         portable streaming
>> >      >          >     pipelines, we could use it in our work to test
>> Python
>> >      >         streaming
>> >      >          >     scenarios.
>> >      >          >
>> >      >          >     [1] _https://s.apache.org/GVMa_
>> >      >          >
>> >      >          >     pt., 28 wrz 2018 o 08:18 Pablo Estrada
>> >      >         <[email protected] <mailto:[email protected]>
>> >     <mailto:[email protected] <mailto:[email protected]>>
>> >      >          >     <mailto:[email protected]
>> >     <mailto:[email protected]> <mailto:[email protected]
>> >     <mailto:[email protected]>>>>
>> >      >         napisał(a):
>> >      >          >
>> >      >          >         Hi Thomas, all,
>> >      >          >         yes, this is quite important for testing, and
>> >     in fact
>> >      >         I'd think
>> >      >          >         it's important to streamline the insertion of
>> >     native
>> >      >         sources
>> >      >          >         from different runners to make the current
>> runners
>> >      >         more usable.
>> >      >          >         But that's another topic.
>> >      >          >
>> >      >          >         For generators of synthetic data, I had a
>> couple
>> >      >         ideas (and this
>> >      >          >         will show my limited knowledge about Flink and
>> >      >         Streaming, but oh
>> >      >          >         well):
>> >      >          >
>> >      >          >         - Flink experts: Is it possible to add a
>> pure-Beam
>> >      >         generator
>> >      >          >         that will do something like: Impulse ->
>> >      >         ParDo(generate multiple
>> >      >          >         elements) -> Forced "Write" to Flink (e.g.
>> >     something
>> >      >         like a
>> >      >          >         reshuffle), and then have Flink manage the
>> >      >         parallelism for
>> >      >          >         stages downstream from that?
>> >      >          >
>> >      >          >         - If this is not possible, it may be worth
>> >     writing some
>> >      >          >         transform in Flink / other runners that can be
>> >      >         plugged in by
>> >      >          >         inserting a custom URN. In fact, it may be a
>> >     good idea to
>> >      >          >         streamline the insertion of native sources for
>> >     each
>> >      >         runner based
>> >      >          >         on some sort of CustomURNTransform() ?
>> >      >          >
>> >      >          >         I hope I did not butcher those explanations
>> >     too badly...
>> >      >          >         Best
>> >      >          >         -P.
>> >      >          >
>> >      >          >         On Thu, Sep 27, 2018, 5:55 PM Thomas Weise
>> >      >         <[email protected] <mailto:[email protected]>
>> >     <mailto:[email protected] <mailto:[email protected]>>
>> >      >          >         <mailto:[email protected] <mailto:[email protected]
>> >
>> >     <mailto:[email protected] <mailto:[email protected]>>>> wrote:
>> >      >          >
>> >      >          >             There were a few discussions how we can
>> >      >         facilitate testing
>> >      >          >             for portable streaming pipelines with the
>> >     Flink
>> >      >         runner. The
>> >      >          >             problem is that we currently don't have
>> >     streaming
>> >      >         sources in
>> >      >          >             the Python SDK.
>> >      >          >
>> >      >          >             One way to support testing could be a
>> >     generator
>> >      >         that extends
>> >      >          >             the idea of Impulse to provide a Flink
>> >     native trigger
>> >      >          >             transform, optionally parameterized with
>> an
>> >      >         interval and max
>> >      >          >             count.
>> >      >          >
>> >      >          >             Test pipelines could then follow the
>> generator
>> >      >         with a Map
>> >      >          >             function that creates whatever payloads
>> are
>> >      >         desirable.
>> >      >          >
>> >      >          >             Thoughts?
>> >      >          >
>> >      >          >             Thanks,
>> >      >          >             Thomas
>> >      >          >
>> >      >
>> >
>>
>

Re: Portable Flink runner: Generator source for testing

Reply via email to