Re: Portable Flink runner: Generator source for testing

2018-10-10 Thread Micah Wylde
I've opened a JIRA for adding the generator source (BEAM-5707) and sent out
a very rough PR (https://github.com/apache/beam/pull/6637). Would
appreciate any feedback.

On Mon, Oct 8, 2018 at 9:43 AM, Thomas Weise  wrote:

> The portable runner does not support metrics yet: https://s.apache.org/
> apache-beam-portability-support-table
>
> There is also no JIRA referenced in the table, would be good to
> locate/create it.
>
> On Mon, Oct 8, 2018 at 9:11 AM Łukasz Gajowy 
> wrote:
>
>> Does anyone know what is the status of metrics support for Flink Portable
>> Runner? I think we need them to be used in such tests to at least collect
>> time metric that does not contain cluster warm up time, staging resources
>> time and other things that can disturb the actual run time metric. We
>> probably should use the metrics API in some other places (as described in
>> the above-mentioned proposal).
>>
>>
>>
>> pon., 8 paź 2018 o 12:12 Maximilian Michels  napisał(a):
>>
>>> This is correct. However, the example code is only part of Lyft's code
>>> base. Until timer support is done, we would have to do something similar
>>> in our code base.
>>>
>>> On 08.10.18 02:34, Łukasz Gajowy wrote:
>>> > Hi,
>>> >
>>> > just to clarify, judging from the above snippets: it seems that we are
>>> > able now to run tests that use a native source for data generation and
>>> > use them in this form until the Timers are supported. When Timers are
>>> > there, we should consider switching to the Impulse + PTransform based
>>> > solution (described above) because it's more portable - the current is
>>> > dedicated to Flink only (which still is really cool). Is this correct
>>> or
>>> > am I missing something?
>>> >
>>> > Łukasz
>>> >
>>> > pt., 5 paź 2018 o 14:04 Maximilian Michels >> > > napisał(a):
>>> >
>>> > Thanks for sharing your setup. You're right that we need timers to
>>> > continuously ingest data to the testing pipeline.
>>> >
>>> > Here is the Flink source which generates the data:
>>> > https://github.com/mwylde/beam/commit/
>>> 09c62991773c749bc037cc2b6044896e2d34988a#diff-
>>> b2fc8d680d9c1da86ba23345f3bc83d4R42
>>> >
>>> > On 04.10.18 19:31, Thomas Weise wrote:
>>> >  > FYI here is an example with native generator for portable Flink
>>> > runner:
>>> >  >
>>> >  > https://github.com/mwylde/beam/tree/micah_memory_leak
>>> >  >
>>> > https://github.com/mwylde/beam/blob/22f7099b071e65a76110ecc5beda06
>>> 36ca07e101/sdks/python/apache_beam/examples/streaming_leak.py
>>> >  >
>>> >  > You can use it to run the portable Flink runner in streaming
>>> mode
>>> >  > continuously for testing purposes.
>>> >  >
>>> >  >
>>> >  > On Mon, Oct 1, 2018 at 9:50 AM Thomas Weise >> > 
>>> >  > >> wrote:
>>> >  >
>>> >  >
>>> >  >
>>> >  > On Mon, Oct 1, 2018 at 8:29 AM Maximilian Michels
>>> > mailto:m...@apache.org>
>>> >  > >> wrote:
>>> >  >
>>> >  >  > and then have Flink manage the parallelism for stages
>>> >  > downstream from that?@Pablo Can you clarify what you
>>> mean
>>> > by that?
>>> >  >
>>> >  > Let me paraphrase this just to get a clear
>>> understanding.
>>> > There
>>> >  > are two
>>> >  > approaches to test portable streaming pipelines:
>>> >  >
>>> >  > a) Use an Impulse followed by a test PTransform which
>>> > generates
>>> >  > testing
>>> >  > data. This is similar to how streaming sources work
>>> which
>>> > don't
>>> >  > use the
>>> >  > Read Transform. For basic testing this should work, even
>>> > without
>>> >  > support
>>> >  > for Timers.
>>> >  >
>>> >  >
>>> >  > AFAIK this works for bounded sources and batch mode of the
>>> Flink
>>> >  > runner (staged execution).
>>> >  >
>>> >  > For streaming we need small bundles, we cannot have a Python
>>> > ParDo
>>> >  > block to emit records periodically.
>>> >  >
>>> >  > (With timers, the ParDo wouldn't block but instead schedule
>>> > itself
>>> >  > as needed.)
>>> >  >
>>> >  > b) Introduce a new URN which gets translated to a native
>>> >  > Flink/Spark/xy
>>> >  > testing transform.
>>> >  >
>>> >  > We should go for a) as this will make testing easier
>>> across
>>> >  > portable
>>> >  > runners. We previously discussed native transforms will
>>> be an
>>> >  > option in
>>> >  > Beam, but it would be preferable to leave them out of
>>> testing
>>> >  > for now.
>>> >  >
>>> >  > Thanks,
>>> >  > Max
>>> >  >
>>> 

Re: Portable Flink runner: Generator source for testing

2018-10-08 Thread Thomas Weise
The portable runner does not support metrics yet:
https://s.apache.org/apache-beam-portability-support-table

There is also no JIRA referenced in the table, would be good to
locate/create it.

On Mon, Oct 8, 2018 at 9:11 AM Łukasz Gajowy 
wrote:

> Does anyone know what is the status of metrics support for Flink Portable
> Runner? I think we need them to be used in such tests to at least collect
> time metric that does not contain cluster warm up time, staging resources
> time and other things that can disturb the actual run time metric. We
> probably should use the metrics API in some other places (as described in
> the above-mentioned proposal).
>
>
>
> pon., 8 paź 2018 o 12:12 Maximilian Michels  napisał(a):
>
>> This is correct. However, the example code is only part of Lyft's code
>> base. Until timer support is done, we would have to do something similar
>> in our code base.
>>
>> On 08.10.18 02:34, Łukasz Gajowy wrote:
>> > Hi,
>> >
>> > just to clarify, judging from the above snippets: it seems that we are
>> > able now to run tests that use a native source for data generation and
>> > use them in this form until the Timers are supported. When Timers are
>> > there, we should consider switching to the Impulse + PTransform based
>> > solution (described above) because it's more portable - the current is
>> > dedicated to Flink only (which still is really cool). Is this correct
>> or
>> > am I missing something?
>> >
>> > Łukasz
>> >
>> > pt., 5 paź 2018 o 14:04 Maximilian Michels > > > napisał(a):
>> >
>> > Thanks for sharing your setup. You're right that we need timers to
>> > continuously ingest data to the testing pipeline.
>> >
>> > Here is the Flink source which generates the data:
>> >
>> https://github.com/mwylde/beam/commit/09c62991773c749bc037cc2b6044896e2d34988a#diff-b2fc8d680d9c1da86ba23345f3bc83d4R42
>> >
>> > On 04.10.18 19:31, Thomas Weise wrote:
>> >  > FYI here is an example with native generator for portable Flink
>> > runner:
>> >  >
>> >  > https://github.com/mwylde/beam/tree/micah_memory_leak
>> >  >
>> >
>> https://github.com/mwylde/beam/blob/22f7099b071e65a76110ecc5beda0636ca07e101/sdks/python/apache_beam/examples/streaming_leak.py
>> >  >
>> >  > You can use it to run the portable Flink runner in streaming mode
>> >  > continuously for testing purposes.
>> >  >
>> >  >
>> >  > On Mon, Oct 1, 2018 at 9:50 AM Thomas Weise > > 
>> >  > >> wrote:
>> >  >
>> >  >
>> >  >
>> >  > On Mon, Oct 1, 2018 at 8:29 AM Maximilian Michels
>> > mailto:m...@apache.org>
>> >  > >> wrote:
>> >  >
>> >  >  > and then have Flink manage the parallelism for stages
>> >  > downstream from that?@Pablo Can you clarify what you mean
>> > by that?
>> >  >
>> >  > Let me paraphrase this just to get a clear understanding.
>> > There
>> >  > are two
>> >  > approaches to test portable streaming pipelines:
>> >  >
>> >  > a) Use an Impulse followed by a test PTransform which
>> > generates
>> >  > testing
>> >  > data. This is similar to how streaming sources work which
>> > don't
>> >  > use the
>> >  > Read Transform. For basic testing this should work, even
>> > without
>> >  > support
>> >  > for Timers.
>> >  >
>> >  >
>> >  > AFAIK this works for bounded sources and batch mode of the
>> Flink
>> >  > runner (staged execution).
>> >  >
>> >  > For streaming we need small bundles, we cannot have a Python
>> > ParDo
>> >  > block to emit records periodically.
>> >  >
>> >  > (With timers, the ParDo wouldn't block but instead schedule
>> > itself
>> >  > as needed.)
>> >  >
>> >  > b) Introduce a new URN which gets translated to a native
>> >  > Flink/Spark/xy
>> >  > testing transform.
>> >  >
>> >  > We should go for a) as this will make testing easier
>> across
>> >  > portable
>> >  > runners. We previously discussed native transforms will
>> be an
>> >  > option in
>> >  > Beam, but it would be preferable to leave them out of
>> testing
>> >  > for now.
>> >  >
>> >  > Thanks,
>> >  > Max
>> >  >
>> >  >
>> >  > On 28.09.18 21:14, Thomas Weise wrote:
>> >  >  > Thanks for sharing the link, this looks very
>> promising!
>> >  >  >
>> >  >  > For the synthetic source, if we need a runner native
>> > trigger
>> >  > mechanism,
>> >  >  > then it should probably just emit an empty byte array
>> like

Re: Portable Flink runner: Generator source for testing

2018-10-08 Thread Łukasz Gajowy
Does anyone know what is the status of metrics support for Flink Portable
Runner? I think we need them to be used in such tests to at least collect
time metric that does not contain cluster warm up time, staging resources
time and other things that can disturb the actual run time metric. We
probably should use the metrics API in some other places (as described in
the above-mentioned proposal).



pon., 8 paź 2018 o 12:12 Maximilian Michels  napisał(a):

> This is correct. However, the example code is only part of Lyft's code
> base. Until timer support is done, we would have to do something similar
> in our code base.
>
> On 08.10.18 02:34, Łukasz Gajowy wrote:
> > Hi,
> >
> > just to clarify, judging from the above snippets: it seems that we are
> > able now to run tests that use a native source for data generation and
> > use them in this form until the Timers are supported. When Timers are
> > there, we should consider switching to the Impulse + PTransform based
> > solution (described above) because it's more portable - the current is
> > dedicated to Flink only (which still is really cool). Is this correct or
> > am I missing something?
> >
> > Łukasz
> >
> > pt., 5 paź 2018 o 14:04 Maximilian Michels  > > napisał(a):
> >
> > Thanks for sharing your setup. You're right that we need timers to
> > continuously ingest data to the testing pipeline.
> >
> > Here is the Flink source which generates the data:
> >
> https://github.com/mwylde/beam/commit/09c62991773c749bc037cc2b6044896e2d34988a#diff-b2fc8d680d9c1da86ba23345f3bc83d4R42
> >
> > On 04.10.18 19:31, Thomas Weise wrote:
> >  > FYI here is an example with native generator for portable Flink
> > runner:
> >  >
> >  > https://github.com/mwylde/beam/tree/micah_memory_leak
> >  >
> >
> https://github.com/mwylde/beam/blob/22f7099b071e65a76110ecc5beda0636ca07e101/sdks/python/apache_beam/examples/streaming_leak.py
> >  >
> >  > You can use it to run the portable Flink runner in streaming mode
> >  > continuously for testing purposes.
> >  >
> >  >
> >  > On Mon, Oct 1, 2018 at 9:50 AM Thomas Weise  > 
> >  > >> wrote:
> >  >
> >  >
> >  >
> >  > On Mon, Oct 1, 2018 at 8:29 AM Maximilian Michels
> > mailto:m...@apache.org>
> >  > >> wrote:
> >  >
> >  >  > and then have Flink manage the parallelism for stages
> >  > downstream from that?@Pablo Can you clarify what you mean
> > by that?
> >  >
> >  > Let me paraphrase this just to get a clear understanding.
> > There
> >  > are two
> >  > approaches to test portable streaming pipelines:
> >  >
> >  > a) Use an Impulse followed by a test PTransform which
> > generates
> >  > testing
> >  > data. This is similar to how streaming sources work which
> > don't
> >  > use the
> >  > Read Transform. For basic testing this should work, even
> > without
> >  > support
> >  > for Timers.
> >  >
> >  >
> >  > AFAIK this works for bounded sources and batch mode of the
> Flink
> >  > runner (staged execution).
> >  >
> >  > For streaming we need small bundles, we cannot have a Python
> > ParDo
> >  > block to emit records periodically.
> >  >
> >  > (With timers, the ParDo wouldn't block but instead schedule
> > itself
> >  > as needed.)
> >  >
> >  > b) Introduce a new URN which gets translated to a native
> >  > Flink/Spark/xy
> >  > testing transform.
> >  >
> >  > We should go for a) as this will make testing easier
> across
> >  > portable
> >  > runners. We previously discussed native transforms will
> be an
> >  > option in
> >  > Beam, but it would be preferable to leave them out of
> testing
> >  > for now.
> >  >
> >  > Thanks,
> >  > Max
> >  >
> >  >
> >  > On 28.09.18 21:14, Thomas Weise wrote:
> >  >  > Thanks for sharing the link, this looks very promising!
> >  >  >
> >  >  > For the synthetic source, if we need a runner native
> > trigger
> >  > mechanism,
> >  >  > then it should probably just emit an empty byte array
> like
> >  > the impulse
> >  >  > implementation does, and everything else could be left
> > to SDK
> >  > specific
> >  >  > transforms that are downstream. We don't have support
> for
> >  > timers in the
> >  >  > portable Flink runner yet. With timers, there would
> not be
> >  > the need 

Re: Portable Flink runner: Generator source for testing

2018-10-08 Thread Maximilian Michels
This is correct. However, the example code is only part of Lyft's code 
base. Until timer support is done, we would have to do something similar 
in our code base.


On 08.10.18 02:34, Łukasz Gajowy wrote:

Hi,

just to clarify, judging from the above snippets: it seems that we are 
able now to run tests that use a native source for data generation and 
use them in this form until the Timers are supported. When Timers are 
there, we should consider switching to the Impulse + PTransform based 
solution (described above) because it's more portable - the current is 
dedicated to Flink only (which still is really cool). Is this correct or 
am I missing something?


Łukasz

pt., 5 paź 2018 o 14:04 Maximilian Michels > napisał(a):


Thanks for sharing your setup. You're right that we need timers to
continuously ingest data to the testing pipeline.

Here is the Flink source which generates the data:

https://github.com/mwylde/beam/commit/09c62991773c749bc037cc2b6044896e2d34988a#diff-b2fc8d680d9c1da86ba23345f3bc83d4R42

On 04.10.18 19:31, Thomas Weise wrote:
 > FYI here is an example with native generator for portable Flink
runner:
 >
 > https://github.com/mwylde/beam/tree/micah_memory_leak
 >

https://github.com/mwylde/beam/blob/22f7099b071e65a76110ecc5beda0636ca07e101/sdks/python/apache_beam/examples/streaming_leak.py
 >
 > You can use it to run the portable Flink runner in streaming mode
 > continuously for testing purposes.
 >
 >
 > On Mon, Oct 1, 2018 at 9:50 AM Thomas Weise mailto:t...@apache.org>
 > >> wrote:
 >
 >
 >
 >     On Mon, Oct 1, 2018 at 8:29 AM Maximilian Michels
mailto:m...@apache.org>
 >     >> wrote:
 >
 >          > and then have Flink manage the parallelism for stages
 >         downstream from that?@Pablo Can you clarify what you mean
by that?
 >
 >         Let me paraphrase this just to get a clear understanding.
There
 >         are two
 >         approaches to test portable streaming pipelines:
 >
 >         a) Use an Impulse followed by a test PTransform which
generates
 >         testing
 >         data. This is similar to how streaming sources work which
don't
 >         use the
 >         Read Transform. For basic testing this should work, even
without
 >         support
 >         for Timers.
 >
 >
 >     AFAIK this works for bounded sources and batch mode of the Flink
 >     runner (staged execution).
 >
 >     For streaming we need small bundles, we cannot have a Python
ParDo
 >     block to emit records periodically.
 >
 >     (With timers, the ParDo wouldn't block but instead schedule
itself
 >     as needed.)
 >
 >         b) Introduce a new URN which gets translated to a native
 >         Flink/Spark/xy
 >         testing transform.
 >
 >         We should go for a) as this will make testing easier across
 >         portable
 >         runners. We previously discussed native transforms will be an
 >         option in
 >         Beam, but it would be preferable to leave them out of testing
 >         for now.
 >
 >         Thanks,
 >         Max
 >
 >
 >         On 28.09.18 21:14, Thomas Weise wrote:
 >          > Thanks for sharing the link, this looks very promising!
 >          >
 >          > For the synthetic source, if we need a runner native
trigger
 >         mechanism,
 >          > then it should probably just emit an empty byte array like
 >         the impulse
 >          > implementation does, and everything else could be left
to SDK
 >         specific
 >          > transforms that are downstream. We don't have support for
 >         timers in the
 >          > portable Flink runner yet. With timers, there would not be
 >         the need for
 >          > a runner native URN and it could work just like Pablo
described.
 >          >
 >          >
 >          > On Fri, Sep 28, 2018 at 3:09 AM Łukasz Gajowy
 >         mailto:lukasz.gaj...@gmail.com>
>
 >          > 
 >                   >
 >          >     Hi all,
 >          >
 >          >     thank you, Thomas, for starting this discussion
and Pablo for
 >          >     sharing the ideas. FWIW adding here, we discussed
this in
 >         terms of
 >          >     Core Beam Transform Load Tests that we are working on
 >         right now [1].
 >          >     If 

Re: Portable Flink runner: Generator source for testing

2018-10-07 Thread Łukasz Gajowy
Hi,

just to clarify, judging from the above snippets: it seems that we are able
now to run tests that use a native source for data generation and use them
in this form until the Timers are supported. When Timers are there, we
should consider switching to the Impulse + PTransform based solution
(described above) because it's more portable - the current is dedicated to
Flink only (which still is really cool). Is this correct or am I missing
something?

Łukasz

pt., 5 paź 2018 o 14:04 Maximilian Michels  napisał(a):

> Thanks for sharing your setup. You're right that we need timers to
> continuously ingest data to the testing pipeline.
>
> Here is the Flink source which generates the data:
>
> https://github.com/mwylde/beam/commit/09c62991773c749bc037cc2b6044896e2d34988a#diff-b2fc8d680d9c1da86ba23345f3bc83d4R42
>
> On 04.10.18 19:31, Thomas Weise wrote:
> > FYI here is an example with native generator for portable Flink runner:
> >
> > https://github.com/mwylde/beam/tree/micah_memory_leak
> >
> https://github.com/mwylde/beam/blob/22f7099b071e65a76110ecc5beda0636ca07e101/sdks/python/apache_beam/examples/streaming_leak.py
> >
> > You can use it to run the portable Flink runner in streaming mode
> > continuously for testing purposes.
> >
> >
> > On Mon, Oct 1, 2018 at 9:50 AM Thomas Weise  > > wrote:
> >
> >
> >
> > On Mon, Oct 1, 2018 at 8:29 AM Maximilian Michels  > > wrote:
> >
> >  > and then have Flink manage the parallelism for stages
> > downstream from that?@Pablo Can you clarify what you mean by
> that?
> >
> > Let me paraphrase this just to get a clear understanding. There
> > are two
> > approaches to test portable streaming pipelines:
> >
> > a) Use an Impulse followed by a test PTransform which generates
> > testing
> > data. This is similar to how streaming sources work which don't
> > use the
> > Read Transform. For basic testing this should work, even without
> > support
> > for Timers.
> >
> >
> > AFAIK this works for bounded sources and batch mode of the Flink
> > runner (staged execution).
> >
> > For streaming we need small bundles, we cannot have a Python ParDo
> > block to emit records periodically.
> >
> > (With timers, the ParDo wouldn't block but instead schedule itself
> > as needed.)
> >
> > b) Introduce a new URN which gets translated to a native
> > Flink/Spark/xy
> > testing transform.
> >
> > We should go for a) as this will make testing easier across
> > portable
> > runners. We previously discussed native transforms will be an
> > option in
> > Beam, but it would be preferable to leave them out of testing
> > for now.
> >
> > Thanks,
> > Max
> >
> >
> > On 28.09.18 21:14, Thomas Weise wrote:
> >  > Thanks for sharing the link, this looks very promising!
> >  >
> >  > For the synthetic source, if we need a runner native trigger
> > mechanism,
> >  > then it should probably just emit an empty byte array like
> > the impulse
> >  > implementation does, and everything else could be left to SDK
> > specific
> >  > transforms that are downstream. We don't have support for
> > timers in the
> >  > portable Flink runner yet. With timers, there would not be
> > the need for
> >  > a runner native URN and it could work just like Pablo
> described.
> >  >
> >  >
> >  > On Fri, Sep 28, 2018 at 3:09 AM Łukasz Gajowy
> > mailto:lukasz.gaj...@gmail.com>
> >  >  > >> wrote:
> >  >
> >  > Hi all,
> >  >
> >  > thank you, Thomas, for starting this discussion and Pablo
> for
> >  > sharing the ideas. FWIW adding here, we discussed this in
> > terms of
> >  > Core Beam Transform Load Tests that we are working on
> > right now [1].
> >  > If generating synthetic data will be possible for
> > portable streaming
> >  > pipelines, we could use it in our work to test Python
> > streaming
> >  > scenarios.
> >  >
> >  > [1] _https://s.apache.org/GVMa_
> >  >
> >  > pt., 28 wrz 2018 o 08:18 Pablo Estrada
> > mailto:pabl...@google.com>
> >  > >>
> > napisał(a):
> >  >
> >  > Hi Thomas, all,
> >  > yes, this is quite important for testing, and in fact
> > I'd think
> >  > it's important to streamline the insertion of native
> > sources
> >  > from different runners to make the current 

Re: Portable Flink runner: Generator source for testing

2018-10-05 Thread Maximilian Michels
Thanks for sharing your setup. You're right that we need timers to 
continuously ingest data to the testing pipeline.


Here is the Flink source which generates the data:
https://github.com/mwylde/beam/commit/09c62991773c749bc037cc2b6044896e2d34988a#diff-b2fc8d680d9c1da86ba23345f3bc83d4R42

On 04.10.18 19:31, Thomas Weise wrote:

FYI here is an example with native generator for portable Flink runner:

https://github.com/mwylde/beam/tree/micah_memory_leak
https://github.com/mwylde/beam/blob/22f7099b071e65a76110ecc5beda0636ca07e101/sdks/python/apache_beam/examples/streaming_leak.py

You can use it to run the portable Flink runner in streaming mode 
continuously for testing purposes.



On Mon, Oct 1, 2018 at 9:50 AM Thomas Weise > wrote:




On Mon, Oct 1, 2018 at 8:29 AM Maximilian Michels mailto:m...@apache.org>> wrote:

 > and then have Flink manage the parallelism for stages
downstream from that?@Pablo Can you clarify what you mean by that?

Let me paraphrase this just to get a clear understanding. There
are two
approaches to test portable streaming pipelines:

a) Use an Impulse followed by a test PTransform which generates
testing
data. This is similar to how streaming sources work which don't
use the
Read Transform. For basic testing this should work, even without
support
for Timers.


AFAIK this works for bounded sources and batch mode of the Flink
runner (staged execution).

For streaming we need small bundles, we cannot have a Python ParDo
block to emit records periodically.

(With timers, the ParDo wouldn't block but instead schedule itself
as needed.)

b) Introduce a new URN which gets translated to a native
Flink/Spark/xy
testing transform.

We should go for a) as this will make testing easier across
portable
runners. We previously discussed native transforms will be an
option in
Beam, but it would be preferable to leave them out of testing
for now.

Thanks,
Max


On 28.09.18 21:14, Thomas Weise wrote:
 > Thanks for sharing the link, this looks very promising!
 >
 > For the synthetic source, if we need a runner native trigger
mechanism,
 > then it should probably just emit an empty byte array like
the impulse
 > implementation does, and everything else could be left to SDK
specific
 > transforms that are downstream. We don't have support for
timers in the
 > portable Flink runner yet. With timers, there would not be
the need for
 > a runner native URN and it could work just like Pablo described.
 >
 >
 > On Fri, Sep 28, 2018 at 3:09 AM Łukasz Gajowy
mailto:lukasz.gaj...@gmail.com>
 > >> wrote:
 >
 >     Hi all,
 >
 >     thank you, Thomas, for starting this discussion and Pablo for
 >     sharing the ideas. FWIW adding here, we discussed this in
terms of
 >     Core Beam Transform Load Tests that we are working on
right now [1].
 >     If generating synthetic data will be possible for
portable streaming
 >     pipelines, we could use it in our work to test Python
streaming
 >     scenarios.
 >
 >     [1] _https://s.apache.org/GVMa_
 >
 >     pt., 28 wrz 2018 o 08:18 Pablo Estrada
mailto:pabl...@google.com>
 >     >>
napisał(a):
 >
 >         Hi Thomas, all,
 >         yes, this is quite important for testing, and in fact
I'd think
 >         it's important to streamline the insertion of native
sources
 >         from different runners to make the current runners
more usable.
 >         But that's another topic.
 >
 >         For generators of synthetic data, I had a couple
ideas (and this
 >         will show my limited knowledge about Flink and
Streaming, but oh
 >         well):
 >
 >         - Flink experts: Is it possible to add a pure-Beam
generator
 >         that will do something like: Impulse ->
ParDo(generate multiple
 >         elements) -> Forced "Write" to Flink (e.g. something
like a
 >         reshuffle), and then have Flink manage the
parallelism for
 >         stages downstream from that?
 >
 >         - If this is not possible, it may be worth writing some
 >         transform in Flink / other runners that can be
plugged in by
 >         inserting a custom URN. In fact, it may be a 

Re: Portable Flink runner: Generator source for testing

2018-10-04 Thread Thomas Weise
FYI here is an example with native generator for portable Flink runner:

https://github.com/mwylde/beam/tree/micah_memory_leak
https://github.com/mwylde/beam/blob/22f7099b071e65a76110ecc5beda0636ca07e101/sdks/python/apache_beam/examples/streaming_leak.py

You can use it to run the portable Flink runner in streaming mode
continuously for testing purposes.


On Mon, Oct 1, 2018 at 9:50 AM Thomas Weise  wrote:

>
>
> On Mon, Oct 1, 2018 at 8:29 AM Maximilian Michels  wrote:
>
>> > and then have Flink manage the parallelism for stages downstream from
>> that?@Pablo Can you clarify what you mean by that?
>>
>> Let me paraphrase this just to get a clear understanding. There are two
>> approaches to test portable streaming pipelines:
>>
>> a) Use an Impulse followed by a test PTransform which generates testing
>> data. This is similar to how streaming sources work which don't use the
>> Read Transform. For basic testing this should work, even without support
>> for Timers.
>>
>
> AFAIK this works for bounded sources and batch mode of the Flink runner
> (staged execution).
>
> For streaming we need small bundles, we cannot have a Python ParDo block
> to emit records periodically.
>
> (With timers, the ParDo wouldn't block but instead schedule itself as
> needed.)
>
> b) Introduce a new URN which gets translated to a native Flink/Spark/xy
>> testing transform.
>>
>> We should go for a) as this will make testing easier across portable
>> runners. We previously discussed native transforms will be an option in
>> Beam, but it would be preferable to leave them out of testing for now.
>>
>> Thanks,
>> Max
>>
>>
>> On 28.09.18 21:14, Thomas Weise wrote:
>> > Thanks for sharing the link, this looks very promising!
>> >
>> > For the synthetic source, if we need a runner native trigger mechanism,
>> > then it should probably just emit an empty byte array like the impulse
>> > implementation does, and everything else could be left to SDK specific
>> > transforms that are downstream. We don't have support for timers in the
>> > portable Flink runner yet. With timers, there would not be the need for
>> > a runner native URN and it could work just like Pablo described.
>> >
>> >
>> > On Fri, Sep 28, 2018 at 3:09 AM Łukasz Gajowy > > > wrote:
>> >
>> > Hi all,
>> >
>> > thank you, Thomas, for starting this discussion and Pablo for
>> > sharing the ideas. FWIW adding here, we discussed this in terms of
>> > Core Beam Transform Load Tests that we are working on right now [1].
>> > If generating synthetic data will be possible for portable streaming
>> > pipelines, we could use it in our work to test Python streaming
>> > scenarios.
>> >
>> > [1] _https://s.apache.org/GVMa_
>> >
>> > pt., 28 wrz 2018 o 08:18 Pablo Estrada > > > napisał(a):
>> >
>> > Hi Thomas, all,
>> > yes, this is quite important for testing, and in fact I'd think
>> > it's important to streamline the insertion of native sources
>> > from different runners to make the current runners more usable.
>> > But that's another topic.
>> >
>> > For generators of synthetic data, I had a couple ideas (and this
>> > will show my limited knowledge about Flink and Streaming, but oh
>> > well):
>> >
>> > - Flink experts: Is it possible to add a pure-Beam generator
>> > that will do something like: Impulse -> ParDo(generate multiple
>> > elements) -> Forced "Write" to Flink (e.g. something like a
>> > reshuffle), and then have Flink manage the parallelism for
>> > stages downstream from that?
>> >
>> > - If this is not possible, it may be worth writing some
>> > transform in Flink / other runners that can be plugged in by
>> > inserting a custom URN. In fact, it may be a good idea to
>> > streamline the insertion of native sources for each runner based
>> > on some sort of CustomURNTransform() ?
>> >
>> > I hope I did not butcher those explanations too badly...
>> > Best
>> > -P.
>> >
>> > On Thu, Sep 27, 2018, 5:55 PM Thomas Weise > > > wrote:
>> >
>> > There were a few discussions how we can facilitate testing
>> > for portable streaming pipelines with the Flink runner. The
>> > problem is that we currently don't have streaming sources in
>> > the Python SDK.
>> >
>> > One way to support testing could be a generator that extends
>> > the idea of Impulse to provide a Flink native trigger
>> > transform, optionally parameterized with an interval and max
>> > count.
>> >
>> > Test pipelines could then follow the generator with a Map
>> > function that creates whatever payloads are desirable.
>> >
>> > Thoughts?
>> >
>> > Thanks,
>> >   

Re: Portable Flink runner: Generator source for testing

2018-10-01 Thread Thomas Weise
On Mon, Oct 1, 2018 at 8:29 AM Maximilian Michels  wrote:

> > and then have Flink manage the parallelism for stages downstream from
> that?@Pablo Can you clarify what you mean by that?
>
> Let me paraphrase this just to get a clear understanding. There are two
> approaches to test portable streaming pipelines:
>
> a) Use an Impulse followed by a test PTransform which generates testing
> data. This is similar to how streaming sources work which don't use the
> Read Transform. For basic testing this should work, even without support
> for Timers.
>

AFAIK this works for bounded sources and batch mode of the Flink runner
(staged execution).

For streaming we need small bundles, we cannot have a Python ParDo block to
emit records periodically.

(With timers, the ParDo wouldn't block but instead schedule itself as
needed.)

b) Introduce a new URN which gets translated to a native Flink/Spark/xy
> testing transform.
>
> We should go for a) as this will make testing easier across portable
> runners. We previously discussed native transforms will be an option in
> Beam, but it would be preferable to leave them out of testing for now.
>
> Thanks,
> Max
>
>
> On 28.09.18 21:14, Thomas Weise wrote:
> > Thanks for sharing the link, this looks very promising!
> >
> > For the synthetic source, if we need a runner native trigger mechanism,
> > then it should probably just emit an empty byte array like the impulse
> > implementation does, and everything else could be left to SDK specific
> > transforms that are downstream. We don't have support for timers in the
> > portable Flink runner yet. With timers, there would not be the need for
> > a runner native URN and it could work just like Pablo described.
> >
> >
> > On Fri, Sep 28, 2018 at 3:09 AM Łukasz Gajowy  > > wrote:
> >
> > Hi all,
> >
> > thank you, Thomas, for starting this discussion and Pablo for
> > sharing the ideas. FWIW adding here, we discussed this in terms of
> > Core Beam Transform Load Tests that we are working on right now [1].
> > If generating synthetic data will be possible for portable streaming
> > pipelines, we could use it in our work to test Python streaming
> > scenarios.
> >
> > [1] _https://s.apache.org/GVMa_
> >
> > pt., 28 wrz 2018 o 08:18 Pablo Estrada  > > napisał(a):
> >
> > Hi Thomas, all,
> > yes, this is quite important for testing, and in fact I'd think
> > it's important to streamline the insertion of native sources
> > from different runners to make the current runners more usable.
> > But that's another topic.
> >
> > For generators of synthetic data, I had a couple ideas (and this
> > will show my limited knowledge about Flink and Streaming, but oh
> > well):
> >
> > - Flink experts: Is it possible to add a pure-Beam generator
> > that will do something like: Impulse -> ParDo(generate multiple
> > elements) -> Forced "Write" to Flink (e.g. something like a
> > reshuffle), and then have Flink manage the parallelism for
> > stages downstream from that?
> >
> > - If this is not possible, it may be worth writing some
> > transform in Flink / other runners that can be plugged in by
> > inserting a custom URN. In fact, it may be a good idea to
> > streamline the insertion of native sources for each runner based
> > on some sort of CustomURNTransform() ?
> >
> > I hope I did not butcher those explanations too badly...
> > Best
> > -P.
> >
> > On Thu, Sep 27, 2018, 5:55 PM Thomas Weise  > > wrote:
> >
> > There were a few discussions how we can facilitate testing
> > for portable streaming pipelines with the Flink runner. The
> > problem is that we currently don't have streaming sources in
> > the Python SDK.
> >
> > One way to support testing could be a generator that extends
> > the idea of Impulse to provide a Flink native trigger
> > transform, optionally parameterized with an interval and max
> > count.
> >
> > Test pipelines could then follow the generator with a Map
> > function that creates whatever payloads are desirable.
> >
> > Thoughts?
> >
> > Thanks,
> > Thomas
> >
>


Re: Portable Flink runner: Generator source for testing

2018-10-01 Thread Maximilian Michels

and then have Flink manage the parallelism for stages downstream from 
that?@Pablo Can you clarify what you mean by that?


Let me paraphrase this just to get a clear understanding. There are two 
approaches to test portable streaming pipelines:


a) Use an Impulse followed by a test PTransform which generates testing 
data. This is similar to how streaming sources work which don't use the 
Read Transform. For basic testing this should work, even without support 
for Timers.


b) Introduce a new URN which gets translated to a native Flink/Spark/xy 
testing transform.


We should go for a) as this will make testing easier across portable 
runners. We previously discussed native transforms will be an option in 
Beam, but it would be preferable to leave them out of testing for now.


Thanks,
Max


On 28.09.18 21:14, Thomas Weise wrote:

Thanks for sharing the link, this looks very promising!

For the synthetic source, if we need a runner native trigger mechanism, 
then it should probably just emit an empty byte array like the impulse 
implementation does, and everything else could be left to SDK specific 
transforms that are downstream. We don't have support for timers in the 
portable Flink runner yet. With timers, there would not be the need for 
a runner native URN and it could work just like Pablo described.



On Fri, Sep 28, 2018 at 3:09 AM Łukasz Gajowy > wrote:


Hi all,

thank you, Thomas, for starting this discussion and Pablo for
sharing the ideas. FWIW adding here, we discussed this in terms of
Core Beam Transform Load Tests that we are working on right now [1].
If generating synthetic data will be possible for portable streaming
pipelines, we could use it in our work to test Python streaming
scenarios.

[1] _https://s.apache.org/GVMa_

pt., 28 wrz 2018 o 08:18 Pablo Estrada mailto:pabl...@google.com>> napisał(a):

Hi Thomas, all,
yes, this is quite important for testing, and in fact I'd think
it's important to streamline the insertion of native sources
from different runners to make the current runners more usable.
But that's another topic.

For generators of synthetic data, I had a couple ideas (and this
will show my limited knowledge about Flink and Streaming, but oh
well):

- Flink experts: Is it possible to add a pure-Beam generator
that will do something like: Impulse -> ParDo(generate multiple
elements) -> Forced "Write" to Flink (e.g. something like a
reshuffle), and then have Flink manage the parallelism for
stages downstream from that?

- If this is not possible, it may be worth writing some
transform in Flink / other runners that can be plugged in by
inserting a custom URN. In fact, it may be a good idea to
streamline the insertion of native sources for each runner based
on some sort of CustomURNTransform() ?

I hope I did not butcher those explanations too badly...
Best
-P.

On Thu, Sep 27, 2018, 5:55 PM Thomas Weise mailto:t...@apache.org>> wrote:

There were a few discussions how we can facilitate testing
for portable streaming pipelines with the Flink runner. The
problem is that we currently don't have streaming sources in
the Python SDK.

One way to support testing could be a generator that extends
the idea of Impulse to provide a Flink native trigger
transform, optionally parameterized with an interval and max
count.

Test pipelines could then follow the generator with a Map
function that creates whatever payloads are desirable.

Thoughts?

Thanks,
Thomas



Re: Portable Flink runner: Generator source for testing

2018-09-28 Thread Łukasz Gajowy
Hi all,

thank you, Thomas, for starting this discussion and Pablo for sharing the
ideas. FWIW adding here, we discussed this in terms of Core Beam Transform
Load Tests that we are working on right now [1]. If generating synthetic
data will be possible for portable streaming pipelines, we could use it in
our work to test Python streaming scenarios.

[1] *https://s.apache.org/GVMa *

pt., 28 wrz 2018 o 08:18 Pablo Estrada  napisał(a):

> Hi Thomas, all,
> yes, this is quite important for testing, and in fact I'd think it's
> important to streamline the insertion of native sources from different
> runners to make the current runners more usable. But that's another topic.
>
> For generators of synthetic data, I had a couple ideas (and this will show
> my limited knowledge about Flink and Streaming, but oh well):
>
> - Flink experts: Is it possible to add a pure-Beam generator that will do
> something like: Impulse -> ParDo(generate multiple elements) -> Forced
> "Write" to Flink (e.g. something like a reshuffle), and then have Flink
> manage the parallelism for stages downstream from that?
>
> - If this is not possible, it may be worth writing some transform in Flink
> / other runners that can be plugged in by inserting a custom URN. In fact,
> it may be a good idea to streamline the insertion of native sources for
> each runner based on some sort of CustomURNTransform() ?
>
> I hope I did not butcher those explanations too badly...
> Best
> -P.
>
> On Thu, Sep 27, 2018, 5:55 PM Thomas Weise  wrote:
>
>> There were a few discussions how we can facilitate testing for portable
>> streaming pipelines with the Flink runner. The problem is that we currently
>> don't have streaming sources in the Python SDK.
>>
>> One way to support testing could be a generator that extends the idea of
>> Impulse to provide a Flink native trigger transform, optionally
>> parameterized with an interval and max count.
>>
>> Test pipelines could then follow the generator with a Map function that
>> creates whatever payloads are desirable.
>>
>> Thoughts?
>>
>> Thanks,
>> Thomas
>>
>>


Re: Portable Flink runner: Generator source for testing

2018-09-28 Thread Pablo Estrada
Hi Thomas, all,
yes, this is quite important for testing, and in fact I'd think it's
important to streamline the insertion of native sources from different
runners to make the current runners more usable. But that's another topic.

For generators of synthetic data, I had a couple ideas (and this will show
my limited knowledge about Flink and Streaming, but oh well):

- Flink experts: Is it possible to add a pure-Beam generator that will do
something like: Impulse -> ParDo(generate multiple elements) -> Forced
"Write" to Flink (e.g. something like a reshuffle), and then have Flink
manage the parallelism for stages downstream from that?

- If this is not possible, it may be worth writing some transform in Flink
/ other runners that can be plugged in by inserting a custom URN. In fact,
it may be a good idea to streamline the insertion of native sources for
each runner based on some sort of CustomURNTransform() ?

I hope I did not butcher those explanations too badly...
Best
-P.

On Thu, Sep 27, 2018, 5:55 PM Thomas Weise  wrote:

> There were a few discussions how we can facilitate testing for portable
> streaming pipelines with the Flink runner. The problem is that we currently
> don't have streaming sources in the Python SDK.
>
> One way to support testing could be a generator that extends the idea of
> Impulse to provide a Flink native trigger transform, optionally
> parameterized with an interval and max count.
>
> Test pipelines could then follow the generator with a Map function that
> creates whatever payloads are desirable.
>
> Thoughts?
>
> Thanks,
> Thomas
>
>


Portable Flink runner: Generator source for testing

2018-09-27 Thread Thomas Weise
There were a few discussions how we can facilitate testing for portable
streaming pipelines with the Flink runner. The problem is that we currently
don't have streaming sources in the Python SDK.

One way to support testing could be a generator that extends the idea of
Impulse to provide a Flink native trigger transform, optionally
parameterized with an interval and max count.

Test pipelines could then follow the generator with a Map function that
creates whatever payloads are desirable.

Thoughts?

Thanks,
Thomas