Contributing to Beam

2019-05-03 Thread Shehzaad Nakhoda
Hello

I’m hoping to work with Rueven Lax (Google) on some enhancements and
existing issues.

I would appreciate the ability to create and assign tickets to myself.

My JIRA ID is shehzaadn.

Thanks in advance!
-- 

[image: VentureDive]
*Shehzaad Nakhoda*
Chief Technology Officer

shehz...@venturedive.com
US/Whatsapp: +1 650 208 5107, PK: +92 308 2654179
Skype: shehzaad.nakhoda


Re: Better naming for runner specific options

2019-05-03 Thread Reza Rokni
Great point Lukasz, worker machine could be relevant to multiple runners.

Perhaps for parameters that could have multiple runner relevance, the doc
could be rephrased to reflect its potential multiple uses. For example
change the help information to start with a generic reference " worker type
on the runner" followed by runner specific behavior expected for RunnerA,
RunnerB etc...

But I do worry that without prefix even generic options could cause
confusion. For example if the use of --network is substantially different
between runnerA vs runnerB then the user will only have this information by
reading the help. It will also mean that a pipeline which is expected to
work both on-premise on RunnerA and in the cloud on RunnerB could fail
because the format of the options to pass to --network are different.

Cheers

Reza

*From: *Kenneth Knowles 
*Date: *Sat, 4 May 2019 at 03:54
*To: *dev

Even though they are in classes named for specific runners, they are not
> namespaced. All PipelineOptions exist in a global namespace so they need to
> be careful to be very precise.
>
> It is a good point that even though they may be multiple uses for "machine
> type" they are probably not going to both happen at the same time.
>
> If it becomes an issue, another thing we could do would be to add
> namespacing support so options have less spooky action, or at least have a
> way to resolve it when it happens on accident.
>
> Kenn
>
> On Fri, May 3, 2019 at 10:43 AM Chamikara Jayalath 
> wrote:
>
>> Also, we do have runner specific options classes where truly runner
>> specific options can go.
>>
>>
>> https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/options/DataflowPipelineOptions.java
>>
>> https://github.com/apache/beam/blob/master/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkPipelineOptions.java
>>
>> On Fri, May 3, 2019 at 9:50 AM Ahmet Altay  wrote:
>>
>>> I agree, that is a good point.
>>>
>>> *From: *Lukasz Cwik 
>>> *Date: *Fri, May 3, 2019 at 9:37 AM
>>> *To: *dev
>>>
>>> The concept of a machine type isn't necessarily limited to Dataflow. If
 it made sense for a runner, they could use AWS/Azure machine types as well.

 On Fri, May 3, 2019 at 9:32 AM Ahmet Altay  wrote:

> This idea was discussed in a PR a few months ago, and JIRA was filed
> as a follow up [1]. IMO, it makes sense to use a namespace prefix. The
> primary issue here is that, such a change will very likely be a backward
> incompatible change and would be hard to do before the next major version.
>
> [1] https://issues.apache.org/jira/browse/BEAM-6531
>
> *From: *Reza Rokni 
> *Date: *Thu, May 2, 2019 at 8:00 PM
> *To: * 
>
> Hi,
>>
>> Was reading this SO question:
>>
>>
>> https://stackoverflow.com/questions/53833171/googlecloudoptions-doesnt-have-all-options-that-pipeline-options-has
>>
>> And noticed that in
>>
>>
>> https://beam.apache.org/releases/pydoc/2.12.0/_modules/apache_beam/options/pipeline_options.html#WorkerOptions
>>
>> The option is called --worker_machine_type.
>>
>> I wonder if runner specific options should have the runner in the
>> prefix? Something like --dataflow_worker_machine_type?
>>
>> Cheers
>> Reza
>>
>> --
>>
>> This email may be confidential and privileged. If you received this
>> communication by mistake, please don't forward it to anyone else, please
>> erase all copies and attachments, and please let me know that it has gone
>> to the wrong person.
>>
>> The above terms reflect a potential business arrangement, are
>> provided solely as a basis for further discussion, and are not intended 
>> to
>> be and do not constitute a legally binding obligation. No legally binding
>> obligations will be created, implied, or inferred until an agreement in
>> final form is executed in writing by all parties involved.
>>
>

-- 

This email may be confidential and privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone
to the wrong person.

The above terms reflect a potential business arrangement, are provided
solely as a basis for further discussion, and are not intended to be and do
not constitute a legally binding obligation. No legally binding obligations
will be created, implied, or inferred until an agreement in final form is
executed in writing by all parties involved.


Re: [ANNOUNCE] New committer announcement: Udi Meiri

2019-05-03 Thread Heejong Lee
Congratulations!

On Fri, May 3, 2019 at 3:53 PM Reza Rokni  wrote:

> Congratulations !
>
> *From: *Reuven Lax 
> *Date: *Sat, 4 May 2019, 06:42
> *To: *dev
>
> Thank you!
>>
>> On Fri, May 3, 2019 at 3:15 PM Ankur Goenka  wrote:
>>
>>> Congratulations Udi!
>>>
>>> On Fri, May 3, 2019 at 3:00 PM Connell O'Callaghan 
>>> wrote:
>>>
 Well done Udi!!! Congratulations and thank you for your
 contributions!!!

 Kenn thank you for sharing!!!

 On Fri, May 3, 2019 at 2:49 PM Yifan Zou  wrote:

> Thanks Udi and congratulations!
>
> On Fri, May 3, 2019 at 2:47 PM Robin Qiu  wrote:
>
>> Congratulations Udi!!!
>>
>> *From: *Ruoyun Huang 
>> *Date: *Fri, May 3, 2019 at 2:39 PM
>> *To: * 
>>
>> Congratulations Udi!
>>>
>>> On Fri, May 3, 2019 at 2:30 PM Ahmet Altay  wrote:
>>>
 Congratulations, Udi!

 *From: *Kyle Weaver 
 *Date: *Fri, May 3, 2019 at 2:11 PM
 *To: * 

 Congratulations Udi! I look forward to sending you all my reviews
> for
> the next month (just kidding :)
>
> Kyle Weaver | Software Engineer | github.com/ibzib |
> kcwea...@google.com | +1650203
>
> On Fri, May 3, 2019 at 1:52 PM Charles Chen 
> wrote:
> >
> > Thank you Udi!
> >
> > On Fri, May 3, 2019, 1:51 PM Aizhamal Nurmamat kyzy <
> aizha...@google.com> wrote:
> >>
> >> Congratulations, Udi! Thank you for all your contributions!!!
> >>
> >> From: Pablo Estrada 
> >> Date: Fri, May 3, 2019 at 1:45 PM
> >> To: dev
> >>
> >>> Thanks Udi and congrats!
> >>>
> >>> On Fri, May 3, 2019 at 1:44 PM Kenneth Knowles <
> k...@apache.org> wrote:
> 
>  Hi all,
> 
>  Please join me and the rest of the Beam PMC in welcoming a
> new committer: Udi Meiri.
> 
>  Udi has been contributing to Beam since late 2017, starting
> with HDFS support in the Python SDK and continuing with a ton of 
> Python
> work. I also will highlight his work on community-building 
> infrastructure,
> including documentation, experiments with ways to find reviewers for 
> pull
> requests, gradle build work, analyzing and reducing build times.
> 
>  In consideration of Udi's contributions, the Beam PMC trusts
> Udi with the responsibilities of a Beam committer [1].
> 
>  Thank you, Udi, for your contributions.
> 
>  Kenn
> 
>  [1]
> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>

>>>
>>> --
>>> 
>>> Ruoyun  Huang
>>>
>>>


Kotlin iterator error

2019-05-03 Thread Ankur Goenka
Hi,

A beam user on stackoverflow has posted issue while using kotlin sdk.
https://stackoverflow.com/questions/55908999/kotlin-iterable-not-supported-in-apache-beam/55911859#55911859
I am not very familiar with kotlin so can someone please take a look.

Thanks,
Ankur


Re: [ANNOUNCE] New committer announcement: Udi Meiri

2019-05-03 Thread Reza Rokni
Congratulations !

*From: *Reuven Lax 
*Date: *Sat, 4 May 2019, 06:42
*To: *dev

Thank you!
>
> On Fri, May 3, 2019 at 3:15 PM Ankur Goenka  wrote:
>
>> Congratulations Udi!
>>
>> On Fri, May 3, 2019 at 3:00 PM Connell O'Callaghan 
>> wrote:
>>
>>> Well done Udi!!! Congratulations and thank you for your contributions!!!
>>>
>>> Kenn thank you for sharing!!!
>>>
>>> On Fri, May 3, 2019 at 2:49 PM Yifan Zou  wrote:
>>>
 Thanks Udi and congratulations!

 On Fri, May 3, 2019 at 2:47 PM Robin Qiu  wrote:

> Congratulations Udi!!!
>
> *From: *Ruoyun Huang 
> *Date: *Fri, May 3, 2019 at 2:39 PM
> *To: * 
>
> Congratulations Udi!
>>
>> On Fri, May 3, 2019 at 2:30 PM Ahmet Altay  wrote:
>>
>>> Congratulations, Udi!
>>>
>>> *From: *Kyle Weaver 
>>> *Date: *Fri, May 3, 2019 at 2:11 PM
>>> *To: * 
>>>
>>> Congratulations Udi! I look forward to sending you all my reviews for
 the next month (just kidding :)

 Kyle Weaver | Software Engineer | github.com/ibzib |
 kcwea...@google.com | +1650203

 On Fri, May 3, 2019 at 1:52 PM Charles Chen  wrote:
 >
 > Thank you Udi!
 >
 > On Fri, May 3, 2019, 1:51 PM Aizhamal Nurmamat kyzy <
 aizha...@google.com> wrote:
 >>
 >> Congratulations, Udi! Thank you for all your contributions!!!
 >>
 >> From: Pablo Estrada 
 >> Date: Fri, May 3, 2019 at 1:45 PM
 >> To: dev
 >>
 >>> Thanks Udi and congrats!
 >>>
 >>> On Fri, May 3, 2019 at 1:44 PM Kenneth Knowles 
 wrote:
 
  Hi all,
 
  Please join me and the rest of the Beam PMC in welcoming a new
 committer: Udi Meiri.
 
  Udi has been contributing to Beam since late 2017, starting
 with HDFS support in the Python SDK and continuing with a ton of Python
 work. I also will highlight his work on community-building 
 infrastructure,
 including documentation, experiments with ways to find reviewers for 
 pull
 requests, gradle build work, analyzing and reducing build times.
 
  In consideration of Udi's contributions, the Beam PMC trusts
 Udi with the responsibilities of a Beam committer [1].
 
  Thank you, Udi, for your contributions.
 
  Kenn
 
  [1]
 https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer

>>>
>>
>> --
>> 
>> Ruoyun  Huang
>>
>>


Re: [ANNOUNCE] New committer announcement: Udi Meiri

2019-05-03 Thread Rui Wang
Congrats! Thank you for your contributions!

-Rui

On Fri, May 3, 2019 at 3:45 PM Chamikara Jayalath 
wrote:

> Congrats Udi!
>
> On Fri, May 3, 2019 at 3:42 PM Reuven Lax  wrote:
>
>> Thank you!
>>
>> On Fri, May 3, 2019 at 3:15 PM Ankur Goenka  wrote:
>>
>>> Congratulations Udi!
>>>
>>> On Fri, May 3, 2019 at 3:00 PM Connell O'Callaghan 
>>> wrote:
>>>
 Well done Udi!!! Congratulations and thank you for your
 contributions!!!

 Kenn thank you for sharing!!!

 On Fri, May 3, 2019 at 2:49 PM Yifan Zou  wrote:

> Thanks Udi and congratulations!
>
> On Fri, May 3, 2019 at 2:47 PM Robin Qiu  wrote:
>
>> Congratulations Udi!!!
>>
>> *From: *Ruoyun Huang 
>> *Date: *Fri, May 3, 2019 at 2:39 PM
>> *To: * 
>>
>> Congratulations Udi!
>>>
>>> On Fri, May 3, 2019 at 2:30 PM Ahmet Altay  wrote:
>>>
 Congratulations, Udi!

 *From: *Kyle Weaver 
 *Date: *Fri, May 3, 2019 at 2:11 PM
 *To: * 

 Congratulations Udi! I look forward to sending you all my reviews
> for
> the next month (just kidding :)
>
> Kyle Weaver | Software Engineer | github.com/ibzib |
> kcwea...@google.com | +1650203
>
> On Fri, May 3, 2019 at 1:52 PM Charles Chen 
> wrote:
> >
> > Thank you Udi!
> >
> > On Fri, May 3, 2019, 1:51 PM Aizhamal Nurmamat kyzy <
> aizha...@google.com> wrote:
> >>
> >> Congratulations, Udi! Thank you for all your contributions!!!
> >>
> >> From: Pablo Estrada 
> >> Date: Fri, May 3, 2019 at 1:45 PM
> >> To: dev
> >>
> >>> Thanks Udi and congrats!
> >>>
> >>> On Fri, May 3, 2019 at 1:44 PM Kenneth Knowles <
> k...@apache.org> wrote:
> 
>  Hi all,
> 
>  Please join me and the rest of the Beam PMC in welcoming a
> new committer: Udi Meiri.
> 
>  Udi has been contributing to Beam since late 2017, starting
> with HDFS support in the Python SDK and continuing with a ton of 
> Python
> work. I also will highlight his work on community-building 
> infrastructure,
> including documentation, experiments with ways to find reviewers for 
> pull
> requests, gradle build work, analyzing and reducing build times.
> 
>  In consideration of Udi's contributions, the Beam PMC trusts
> Udi with the responsibilities of a Beam committer [1].
> 
>  Thank you, Udi, for your contributions.
> 
>  Kenn
> 
>  [1]
> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>

>>>
>>> --
>>> 
>>> Ruoyun  Huang
>>>
>>>


Re: [ANNOUNCE] New committer announcement: Udi Meiri

2019-05-03 Thread Chamikara Jayalath
Congrats Udi!

On Fri, May 3, 2019 at 3:42 PM Reuven Lax  wrote:

> Thank you!
>
> On Fri, May 3, 2019 at 3:15 PM Ankur Goenka  wrote:
>
>> Congratulations Udi!
>>
>> On Fri, May 3, 2019 at 3:00 PM Connell O'Callaghan 
>> wrote:
>>
>>> Well done Udi!!! Congratulations and thank you for your contributions!!!
>>>
>>> Kenn thank you for sharing!!!
>>>
>>> On Fri, May 3, 2019 at 2:49 PM Yifan Zou  wrote:
>>>
 Thanks Udi and congratulations!

 On Fri, May 3, 2019 at 2:47 PM Robin Qiu  wrote:

> Congratulations Udi!!!
>
> *From: *Ruoyun Huang 
> *Date: *Fri, May 3, 2019 at 2:39 PM
> *To: * 
>
> Congratulations Udi!
>>
>> On Fri, May 3, 2019 at 2:30 PM Ahmet Altay  wrote:
>>
>>> Congratulations, Udi!
>>>
>>> *From: *Kyle Weaver 
>>> *Date: *Fri, May 3, 2019 at 2:11 PM
>>> *To: * 
>>>
>>> Congratulations Udi! I look forward to sending you all my reviews for
 the next month (just kidding :)

 Kyle Weaver | Software Engineer | github.com/ibzib |
 kcwea...@google.com | +1650203

 On Fri, May 3, 2019 at 1:52 PM Charles Chen  wrote:
 >
 > Thank you Udi!
 >
 > On Fri, May 3, 2019, 1:51 PM Aizhamal Nurmamat kyzy <
 aizha...@google.com> wrote:
 >>
 >> Congratulations, Udi! Thank you for all your contributions!!!
 >>
 >> From: Pablo Estrada 
 >> Date: Fri, May 3, 2019 at 1:45 PM
 >> To: dev
 >>
 >>> Thanks Udi and congrats!
 >>>
 >>> On Fri, May 3, 2019 at 1:44 PM Kenneth Knowles 
 wrote:
 
  Hi all,
 
  Please join me and the rest of the Beam PMC in welcoming a new
 committer: Udi Meiri.
 
  Udi has been contributing to Beam since late 2017, starting
 with HDFS support in the Python SDK and continuing with a ton of Python
 work. I also will highlight his work on community-building 
 infrastructure,
 including documentation, experiments with ways to find reviewers for 
 pull
 requests, gradle build work, analyzing and reducing build times.
 
  In consideration of Udi's contributions, the Beam PMC trusts
 Udi with the responsibilities of a Beam committer [1].
 
  Thank you, Udi, for your contributions.
 
  Kenn
 
  [1]
 https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer

>>>
>>
>> --
>> 
>> Ruoyun  Huang
>>
>>


Re: [ANNOUNCE] New committer announcement: Udi Meiri

2019-05-03 Thread Reuven Lax
Thank you!

On Fri, May 3, 2019 at 3:15 PM Ankur Goenka  wrote:

> Congratulations Udi!
>
> On Fri, May 3, 2019 at 3:00 PM Connell O'Callaghan 
> wrote:
>
>> Well done Udi!!! Congratulations and thank you for your contributions!!!
>>
>> Kenn thank you for sharing!!!
>>
>> On Fri, May 3, 2019 at 2:49 PM Yifan Zou  wrote:
>>
>>> Thanks Udi and congratulations!
>>>
>>> On Fri, May 3, 2019 at 2:47 PM Robin Qiu  wrote:
>>>
 Congratulations Udi!!!

 *From: *Ruoyun Huang 
 *Date: *Fri, May 3, 2019 at 2:39 PM
 *To: * 

 Congratulations Udi!
>
> On Fri, May 3, 2019 at 2:30 PM Ahmet Altay  wrote:
>
>> Congratulations, Udi!
>>
>> *From: *Kyle Weaver 
>> *Date: *Fri, May 3, 2019 at 2:11 PM
>> *To: * 
>>
>> Congratulations Udi! I look forward to sending you all my reviews for
>>> the next month (just kidding :)
>>>
>>> Kyle Weaver | Software Engineer | github.com/ibzib |
>>> kcwea...@google.com | +1650203
>>>
>>> On Fri, May 3, 2019 at 1:52 PM Charles Chen  wrote:
>>> >
>>> > Thank you Udi!
>>> >
>>> > On Fri, May 3, 2019, 1:51 PM Aizhamal Nurmamat kyzy <
>>> aizha...@google.com> wrote:
>>> >>
>>> >> Congratulations, Udi! Thank you for all your contributions!!!
>>> >>
>>> >> From: Pablo Estrada 
>>> >> Date: Fri, May 3, 2019 at 1:45 PM
>>> >> To: dev
>>> >>
>>> >>> Thanks Udi and congrats!
>>> >>>
>>> >>> On Fri, May 3, 2019 at 1:44 PM Kenneth Knowles 
>>> wrote:
>>> 
>>>  Hi all,
>>> 
>>>  Please join me and the rest of the Beam PMC in welcoming a new
>>> committer: Udi Meiri.
>>> 
>>>  Udi has been contributing to Beam since late 2017, starting
>>> with HDFS support in the Python SDK and continuing with a ton of Python
>>> work. I also will highlight his work on community-building 
>>> infrastructure,
>>> including documentation, experiments with ways to find reviewers for 
>>> pull
>>> requests, gradle build work, analyzing and reducing build times.
>>> 
>>>  In consideration of Udi's contributions, the Beam PMC trusts
>>> Udi with the responsibilities of a Beam committer [1].
>>> 
>>>  Thank you, Udi, for your contributions.
>>> 
>>>  Kenn
>>> 
>>>  [1]
>>> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>>>
>>
>
> --
> 
> Ruoyun  Huang
>
>


Re: [ANNOUNCE] New committer announcement: Udi Meiri

2019-05-03 Thread Ankur Goenka
Congratulations Udi!

On Fri, May 3, 2019 at 3:00 PM Connell O'Callaghan 
wrote:

> Well done Udi!!! Congratulations and thank you for your contributions!!!
>
> Kenn thank you for sharing!!!
>
> On Fri, May 3, 2019 at 2:49 PM Yifan Zou  wrote:
>
>> Thanks Udi and congratulations!
>>
>> On Fri, May 3, 2019 at 2:47 PM Robin Qiu  wrote:
>>
>>> Congratulations Udi!!!
>>>
>>> *From: *Ruoyun Huang 
>>> *Date: *Fri, May 3, 2019 at 2:39 PM
>>> *To: * 
>>>
>>> Congratulations Udi!

 On Fri, May 3, 2019 at 2:30 PM Ahmet Altay  wrote:

> Congratulations, Udi!
>
> *From: *Kyle Weaver 
> *Date: *Fri, May 3, 2019 at 2:11 PM
> *To: * 
>
> Congratulations Udi! I look forward to sending you all my reviews for
>> the next month (just kidding :)
>>
>> Kyle Weaver | Software Engineer | github.com/ibzib |
>> kcwea...@google.com | +1650203
>>
>> On Fri, May 3, 2019 at 1:52 PM Charles Chen  wrote:
>> >
>> > Thank you Udi!
>> >
>> > On Fri, May 3, 2019, 1:51 PM Aizhamal Nurmamat kyzy <
>> aizha...@google.com> wrote:
>> >>
>> >> Congratulations, Udi! Thank you for all your contributions!!!
>> >>
>> >> From: Pablo Estrada 
>> >> Date: Fri, May 3, 2019 at 1:45 PM
>> >> To: dev
>> >>
>> >>> Thanks Udi and congrats!
>> >>>
>> >>> On Fri, May 3, 2019 at 1:44 PM Kenneth Knowles 
>> wrote:
>> 
>>  Hi all,
>> 
>>  Please join me and the rest of the Beam PMC in welcoming a new
>> committer: Udi Meiri.
>> 
>>  Udi has been contributing to Beam since late 2017, starting with
>> HDFS support in the Python SDK and continuing with a ton of Python work. 
>> I
>> also will highlight his work on community-building infrastructure,
>> including documentation, experiments with ways to find reviewers for pull
>> requests, gradle build work, analyzing and reducing build times.
>> 
>>  In consideration of Udi's contributions, the Beam PMC trusts Udi
>> with the responsibilities of a Beam committer [1].
>> 
>>  Thank you, Udi, for your contributions.
>> 
>>  Kenn
>> 
>>  [1]
>> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>>
>

 --
 
 Ruoyun  Huang




Re: [ANNOUNCE] New committer announcement: Udi Meiri

2019-05-03 Thread Connell O'Callaghan
Well done Udi!!! Congratulations and thank you for your contributions!!!

Kenn thank you for sharing!!!

On Fri, May 3, 2019 at 2:49 PM Yifan Zou  wrote:

> Thanks Udi and congratulations!
>
> On Fri, May 3, 2019 at 2:47 PM Robin Qiu  wrote:
>
>> Congratulations Udi!!!
>>
>> *From: *Ruoyun Huang 
>> *Date: *Fri, May 3, 2019 at 2:39 PM
>> *To: * 
>>
>> Congratulations Udi!
>>>
>>> On Fri, May 3, 2019 at 2:30 PM Ahmet Altay  wrote:
>>>
 Congratulations, Udi!

 *From: *Kyle Weaver 
 *Date: *Fri, May 3, 2019 at 2:11 PM
 *To: * 

 Congratulations Udi! I look forward to sending you all my reviews for
> the next month (just kidding :)
>
> Kyle Weaver | Software Engineer | github.com/ibzib |
> kcwea...@google.com | +1650203
>
> On Fri, May 3, 2019 at 1:52 PM Charles Chen  wrote:
> >
> > Thank you Udi!
> >
> > On Fri, May 3, 2019, 1:51 PM Aizhamal Nurmamat kyzy <
> aizha...@google.com> wrote:
> >>
> >> Congratulations, Udi! Thank you for all your contributions!!!
> >>
> >> From: Pablo Estrada 
> >> Date: Fri, May 3, 2019 at 1:45 PM
> >> To: dev
> >>
> >>> Thanks Udi and congrats!
> >>>
> >>> On Fri, May 3, 2019 at 1:44 PM Kenneth Knowles 
> wrote:
> 
>  Hi all,
> 
>  Please join me and the rest of the Beam PMC in welcoming a new
> committer: Udi Meiri.
> 
>  Udi has been contributing to Beam since late 2017, starting with
> HDFS support in the Python SDK and continuing with a ton of Python work. I
> also will highlight his work on community-building infrastructure,
> including documentation, experiments with ways to find reviewers for pull
> requests, gradle build work, analyzing and reducing build times.
> 
>  In consideration of Udi's contributions, the Beam PMC trusts Udi
> with the responsibilities of a Beam committer [1].
> 
>  Thank you, Udi, for your contributions.
> 
>  Kenn
> 
>  [1]
> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>

>>>
>>> --
>>> 
>>> Ruoyun  Huang
>>>
>>>


Re: [ANNOUNCE] New committer announcement: Udi Meiri

2019-05-03 Thread Yifan Zou
Thanks Udi and congratulations!

On Fri, May 3, 2019 at 2:47 PM Robin Qiu  wrote:

> Congratulations Udi!!!
>
> *From: *Ruoyun Huang 
> *Date: *Fri, May 3, 2019 at 2:39 PM
> *To: * 
>
> Congratulations Udi!
>>
>> On Fri, May 3, 2019 at 2:30 PM Ahmet Altay  wrote:
>>
>>> Congratulations, Udi!
>>>
>>> *From: *Kyle Weaver 
>>> *Date: *Fri, May 3, 2019 at 2:11 PM
>>> *To: * 
>>>
>>> Congratulations Udi! I look forward to sending you all my reviews for
 the next month (just kidding :)

 Kyle Weaver | Software Engineer | github.com/ibzib |
 kcwea...@google.com | +1650203

 On Fri, May 3, 2019 at 1:52 PM Charles Chen  wrote:
 >
 > Thank you Udi!
 >
 > On Fri, May 3, 2019, 1:51 PM Aizhamal Nurmamat kyzy <
 aizha...@google.com> wrote:
 >>
 >> Congratulations, Udi! Thank you for all your contributions!!!
 >>
 >> From: Pablo Estrada 
 >> Date: Fri, May 3, 2019 at 1:45 PM
 >> To: dev
 >>
 >>> Thanks Udi and congrats!
 >>>
 >>> On Fri, May 3, 2019 at 1:44 PM Kenneth Knowles 
 wrote:
 
  Hi all,
 
  Please join me and the rest of the Beam PMC in welcoming a new
 committer: Udi Meiri.
 
  Udi has been contributing to Beam since late 2017, starting with
 HDFS support in the Python SDK and continuing with a ton of Python work. I
 also will highlight his work on community-building infrastructure,
 including documentation, experiments with ways to find reviewers for pull
 requests, gradle build work, analyzing and reducing build times.
 
  In consideration of Udi's contributions, the Beam PMC trusts Udi
 with the responsibilities of a Beam committer [1].
 
  Thank you, Udi, for your contributions.
 
  Kenn
 
  [1]
 https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer

>>>
>>
>> --
>> 
>> Ruoyun  Huang
>>
>>


Re: [ANNOUNCE] New committer announcement: Udi Meiri

2019-05-03 Thread Robin Qiu
Congratulations Udi!!!

*From: *Ruoyun Huang 
*Date: *Fri, May 3, 2019 at 2:39 PM
*To: * 

Congratulations Udi!
>
> On Fri, May 3, 2019 at 2:30 PM Ahmet Altay  wrote:
>
>> Congratulations, Udi!
>>
>> *From: *Kyle Weaver 
>> *Date: *Fri, May 3, 2019 at 2:11 PM
>> *To: * 
>>
>> Congratulations Udi! I look forward to sending you all my reviews for
>>> the next month (just kidding :)
>>>
>>> Kyle Weaver | Software Engineer | github.com/ibzib |
>>> kcwea...@google.com | +1650203
>>>
>>> On Fri, May 3, 2019 at 1:52 PM Charles Chen  wrote:
>>> >
>>> > Thank you Udi!
>>> >
>>> > On Fri, May 3, 2019, 1:51 PM Aizhamal Nurmamat kyzy <
>>> aizha...@google.com> wrote:
>>> >>
>>> >> Congratulations, Udi! Thank you for all your contributions!!!
>>> >>
>>> >> From: Pablo Estrada 
>>> >> Date: Fri, May 3, 2019 at 1:45 PM
>>> >> To: dev
>>> >>
>>> >>> Thanks Udi and congrats!
>>> >>>
>>> >>> On Fri, May 3, 2019 at 1:44 PM Kenneth Knowles 
>>> wrote:
>>> 
>>>  Hi all,
>>> 
>>>  Please join me and the rest of the Beam PMC in welcoming a new
>>> committer: Udi Meiri.
>>> 
>>>  Udi has been contributing to Beam since late 2017, starting with
>>> HDFS support in the Python SDK and continuing with a ton of Python work. I
>>> also will highlight his work on community-building infrastructure,
>>> including documentation, experiments with ways to find reviewers for pull
>>> requests, gradle build work, analyzing and reducing build times.
>>> 
>>>  In consideration of Udi's contributions, the Beam PMC trusts Udi
>>> with the responsibilities of a Beam committer [1].
>>> 
>>>  Thank you, Udi, for your contributions.
>>> 
>>>  Kenn
>>> 
>>>  [1]
>>> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>>>
>>
>
> --
> 
> Ruoyun  Huang
>
>


Re: [ANNOUNCE] New committer announcement: Udi Meiri

2019-05-03 Thread Ruoyun Huang
Congratulations Udi!

On Fri, May 3, 2019 at 2:30 PM Ahmet Altay  wrote:

> Congratulations, Udi!
>
> *From: *Kyle Weaver 
> *Date: *Fri, May 3, 2019 at 2:11 PM
> *To: * 
>
> Congratulations Udi! I look forward to sending you all my reviews for
>> the next month (just kidding :)
>>
>> Kyle Weaver | Software Engineer | github.com/ibzib |
>> kcwea...@google.com | +1650203
>>
>> On Fri, May 3, 2019 at 1:52 PM Charles Chen  wrote:
>> >
>> > Thank you Udi!
>> >
>> > On Fri, May 3, 2019, 1:51 PM Aizhamal Nurmamat kyzy <
>> aizha...@google.com> wrote:
>> >>
>> >> Congratulations, Udi! Thank you for all your contributions!!!
>> >>
>> >> From: Pablo Estrada 
>> >> Date: Fri, May 3, 2019 at 1:45 PM
>> >> To: dev
>> >>
>> >>> Thanks Udi and congrats!
>> >>>
>> >>> On Fri, May 3, 2019 at 1:44 PM Kenneth Knowles 
>> wrote:
>> 
>>  Hi all,
>> 
>>  Please join me and the rest of the Beam PMC in welcoming a new
>> committer: Udi Meiri.
>> 
>>  Udi has been contributing to Beam since late 2017, starting with
>> HDFS support in the Python SDK and continuing with a ton of Python work. I
>> also will highlight his work on community-building infrastructure,
>> including documentation, experiments with ways to find reviewers for pull
>> requests, gradle build work, analyzing and reducing build times.
>> 
>>  In consideration of Udi's contributions, the Beam PMC trusts Udi
>> with the responsibilities of a Beam committer [1].
>> 
>>  Thank you, Udi, for your contributions.
>> 
>>  Kenn
>> 
>>  [1]
>> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>>
>

-- 

Ruoyun  Huang


Re: [DISCUSS][SQL] Providing support for DISTINCT aggregations

2019-05-03 Thread Rui Wang
A compromise solution would be using SELECT DISTINCT or GROUP BY to
duplicate before apply aggregations. It's two shuffles and works on non
floating point columns. The good thing is no code change is needed, but
downsides are users need to write more complicated query and floating point
data is not supported.


-Rui

On Fri, May 3, 2019 at 1:23 PM Rui Wang  wrote:

> Fair point. It lacks of proper benchmarks for BeamSQL to test performance
> and scalability of implementations.
>
>
> -Rui
>
> On Fri, May 3, 2019 at 12:56 PM Reuven Lax  wrote:
>
>> Back to the original point: I'm very skeptical of adding something that
>> does not scale at all. In our experience, users get far more upset with an
>> advertised feature that doesn't work for them (e.g. their workers OOM) than
>> with a missing feature.
>>
>> Reuven
>>
>> On Fri, May 3, 2019 at 12:41 PM Kenneth Knowles  wrote:
>>
>>> All good points. My version of the two shuffle approach does not work at
>>> all.
>>>
>>> On Fri, May 3, 2019 at 11:38 AM Brian Hulette 
>>> wrote:
>>>
 Rui's point about FLOAT/DOUBLE columns is interesting as well. We
 couldn't support distinct aggregations on floating point columns with the
 two-shuffle approach, but we could with the CombineFn approach. I'm not
 sure if that's a good thing or not, it seems like an anti-pattern to do a
 distinct aggregation on floating point numbers but I suppose the spec
 allows it.

>>>
>>> I can't find the Jira, but grouping on doubles has been discussed at
>>> some length before. Many DBMSs do not provide this, so it is not generally
>>> expected by SQL users. That is good, because mathematically it is
>>> questionable - floating point is usually used as a stand-in for real
>>> numbers, where computing equality is not generally possible. So any code
>>> that actually depends on equality of floating points is likely susceptible
>>> to rounding errors, other quirks of floating point, and also is probably
>>> misguided because the underlying thing that floats are approximating
>>> already cannot be checked for equality.
>>>
>>> Kenn
>>>
>>>

 Brian


 On Fri, May 3, 2019 at 10:52 AM Rui Wang  wrote:

> To clarify what I said "So two shuffle approach will lead to two
> different implementation for tables with and without FLOAT/DOUBLE 
> column.":
>
> Basically I wanted to say that two shuffles approach will be an
> implementation for some cases, and it will co-exist with CombineFn
> approach. In the feature, when we start cost based optimization in
> BeamSQL,  CBO is supposed to compare different plans.
>
> -Rui
>
> On Fri, May 3, 2019 at 10:40 AM Rui Wang  wrote:
>
>>
>>> As to the distinct aggregations: At the least, these queries should
>>> be rejected, not evaluated incorrectly.
>>>
>>
>> Yes. The least is not to support it, and throws clear message to say
>> no. (current implementation ignores DISTINCT and executes all 
>> aggregations
>> as ALL).
>>
>>
>>> The term "stateful CombineFn" is not one I would use, as the nature
>>> of state is linearity and the nature of CombineFn is parallelism. So I
>>> don't totally understand this proposal. If I replace stateful CombineFn
>>> with stateful DoFn with one combining state per column, then I think I
>>> understand. FWIW on a runner with scalable SetState or MapState it will 
>>> not
>>> be any risk at all.
>>>
>>> I see. "Stateful" is indeed misleading. In this thread, it was all
>> about using simple CombineFn to achieve DISTINCT aggregation with massive
>> parallelism.
>>
>> But if you go the two shuffle route, you don't have to separate the
>>> aggregations and re-join them. You just have to incur the cost of the 
>>> GBK +
>>> DISTINCT for all columns, and just drop the secondary key for the second
>>> shuffle, no?
>>>
>>> Two shuffle approach cannot be the unified approach because it
>> requires to build a key of group_by_key + table_row to deduplicate, but
>> table_row might contain floating point numbers, which cannot be used as 
>> key
>> in GBK. So two shuffle approach will lead to two different implementation
>> for tables with and without FLOAT/DOUBLE column.
>>
>> CombineFn is the unified approach for distinct and non distinct
>> aggregation: each aggregation call will be a CombineFn.
>>
>>
>> -Rui
>>
>>
>>
>>> Kenn
>>>
>>> On Thu, May 2, 2019 at 5:11 PM Ahmet Altay  wrote:
>>>


 On Thu, May 2, 2019 at 2:18 PM Rui Wang  wrote:

> Brian's first proposal is challenging also partially because in
> BeamSQL there is no good practice to deal with complex SQL plans. 
> Ideally
> we need enough rules and SQL plan node in Beam to construct
> easy-to-transform plans for different 

Re: [ANNOUNCE] New committer announcement: Udi Meiri

2019-05-03 Thread Ahmet Altay
Congratulations, Udi!

*From: *Kyle Weaver 
*Date: *Fri, May 3, 2019 at 2:11 PM
*To: * 

Congratulations Udi! I look forward to sending you all my reviews for
> the next month (just kidding :)
>
> Kyle Weaver | Software Engineer | github.com/ibzib |
> kcwea...@google.com | +1650203
>
> On Fri, May 3, 2019 at 1:52 PM Charles Chen  wrote:
> >
> > Thank you Udi!
> >
> > On Fri, May 3, 2019, 1:51 PM Aizhamal Nurmamat kyzy 
> wrote:
> >>
> >> Congratulations, Udi! Thank you for all your contributions!!!
> >>
> >> From: Pablo Estrada 
> >> Date: Fri, May 3, 2019 at 1:45 PM
> >> To: dev
> >>
> >>> Thanks Udi and congrats!
> >>>
> >>> On Fri, May 3, 2019 at 1:44 PM Kenneth Knowles 
> wrote:
> 
>  Hi all,
> 
>  Please join me and the rest of the Beam PMC in welcoming a new
> committer: Udi Meiri.
> 
>  Udi has been contributing to Beam since late 2017, starting with HDFS
> support in the Python SDK and continuing with a ton of Python work. I also
> will highlight his work on community-building infrastructure, including
> documentation, experiments with ways to find reviewers for pull requests,
> gradle build work, analyzing and reducing build times.
> 
>  In consideration of Udi's contributions, the Beam PMC trusts Udi with
> the responsibilities of a Beam committer [1].
> 
>  Thank you, Udi, for your contributions.
> 
>  Kenn
> 
>  [1]
> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>


Re: [ANNOUNCE] New committer announcement: Udi Meiri

2019-05-03 Thread Kyle Weaver
Congratulations Udi! I look forward to sending you all my reviews for
the next month (just kidding :)

Kyle Weaver | Software Engineer | github.com/ibzib |
kcwea...@google.com | +1650203

On Fri, May 3, 2019 at 1:52 PM Charles Chen  wrote:
>
> Thank you Udi!
>
> On Fri, May 3, 2019, 1:51 PM Aizhamal Nurmamat kyzy  
> wrote:
>>
>> Congratulations, Udi! Thank you for all your contributions!!!
>>
>> From: Pablo Estrada 
>> Date: Fri, May 3, 2019 at 1:45 PM
>> To: dev
>>
>>> Thanks Udi and congrats!
>>>
>>> On Fri, May 3, 2019 at 1:44 PM Kenneth Knowles  wrote:

 Hi all,

 Please join me and the rest of the Beam PMC in welcoming a new committer: 
 Udi Meiri.

 Udi has been contributing to Beam since late 2017, starting with HDFS 
 support in the Python SDK and continuing with a ton of Python work. I also 
 will highlight his work on community-building infrastructure, including 
 documentation, experiments with ways to find reviewers for pull requests, 
 gradle build work, analyzing and reducing build times.

 In consideration of Udi's contributions, the Beam PMC trusts Udi with the 
 responsibilities of a Beam committer [1].

 Thank you, Udi, for your contributions.

 Kenn

 [1] 
 https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer


Re: [ANNOUNCE] New committer announcement: Udi Meiri

2019-05-03 Thread Charles Chen
Thank you Udi!

On Fri, May 3, 2019, 1:51 PM Aizhamal Nurmamat kyzy 
wrote:

> Congratulations, Udi! Thank you for all your contributions!!!
>
> *From: *Pablo Estrada 
> *Date: *Fri, May 3, 2019 at 1:45 PM
> *To: *dev
>
> Thanks Udi and congrats!
>>
>> On Fri, May 3, 2019 at 1:44 PM Kenneth Knowles  wrote:
>>
>>> Hi all,
>>>
>>> Please join me and the rest of the Beam PMC in welcoming a new committer:
>>> Udi Meiri.
>>>
>>> Udi has been contributing to Beam since late 2017, starting with HDFS
>>> support in the Python SDK and continuing with a ton of Python work. I also
>>> will highlight his work on community-building infrastructure, including
>>> documentation, experiments with ways to find reviewers for pull requests,
>>> gradle build work, analyzing and reducing build times.
>>>
>>> In consideration of Udi's contributions, the Beam PMC trusts Udi with
>>> the responsibilities of a Beam committer [1].
>>>
>>> Thank you, Udi, for your contributions.
>>>
>>> Kenn
>>>
>>> [1] https://beam.apache.org/contribute/become-a-committer/#an-apache-
>>> beam-committer
>>>
>>


Re: [ANNOUNCE] New committer announcement: Udi Meiri

2019-05-03 Thread Aizhamal Nurmamat kyzy
Congratulations, Udi! Thank you for all your contributions!!!

*From: *Pablo Estrada 
*Date: *Fri, May 3, 2019 at 1:45 PM
*To: *dev

Thanks Udi and congrats!
>
> On Fri, May 3, 2019 at 1:44 PM Kenneth Knowles  wrote:
>
>> Hi all,
>>
>> Please join me and the rest of the Beam PMC in welcoming a new committer:
>> Udi Meiri.
>>
>> Udi has been contributing to Beam since late 2017, starting with HDFS
>> support in the Python SDK and continuing with a ton of Python work. I also
>> will highlight his work on community-building infrastructure, including
>> documentation, experiments with ways to find reviewers for pull requests,
>> gradle build work, analyzing and reducing build times.
>>
>> In consideration of Udi's contributions, the Beam PMC trusts Udi with the
>> responsibilities of a Beam committer [1].
>>
>> Thank you, Udi, for your contributions.
>>
>> Kenn
>>
>> [1] https://beam.apache.org/contribute/become-a-committer/#an-apache-beam
>> -committer
>>
>


Re: [ANNOUNCE] New committer announcement: Udi Meiri

2019-05-03 Thread Pablo Estrada
Thanks Udi and congrats!

On Fri, May 3, 2019 at 1:44 PM Kenneth Knowles  wrote:

> Hi all,
>
> Please join me and the rest of the Beam PMC in welcoming a new committer:
> Udi Meiri.
>
> Udi has been contributing to Beam since late 2017, starting with HDFS
> support in the Python SDK and continuing with a ton of Python work. I also
> will highlight his work on community-building infrastructure, including
> documentation, experiments with ways to find reviewers for pull requests,
> gradle build work, analyzing and reducing build times.
>
> In consideration of Udi's contributions, the Beam PMC trusts Udi with the
> responsibilities of a Beam committer [1].
>
> Thank you, Udi, for your contributions.
>
> Kenn
>
> [1] https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-
> committer
>


[ANNOUNCE] New committer announcement: Udi Meiri

2019-05-03 Thread Kenneth Knowles
Hi all,

Please join me and the rest of the Beam PMC in welcoming a new committer:
Udi Meiri.

Udi has been contributing to Beam since late 2017, starting with HDFS
support in the Python SDK and continuing with a ton of Python work. I also
will highlight his work on community-building infrastructure, including
documentation, experiments with ways to find reviewers for pull requests,
gradle build work, analyzing and reducing build times.

In consideration of Udi's contributions, the Beam PMC trusts Udi with the
responsibilities of a Beam committer [1].

Thank you, Udi, for your contributions.

Kenn

[1] https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-
committer


Re: [DISCUSS][SQL] Providing support for DISTINCT aggregations

2019-05-03 Thread Rui Wang
Fair point. It lacks of proper benchmarks for BeamSQL to test performance
and scalability of implementations.


-Rui

On Fri, May 3, 2019 at 12:56 PM Reuven Lax  wrote:

> Back to the original point: I'm very skeptical of adding something that
> does not scale at all. In our experience, users get far more upset with an
> advertised feature that doesn't work for them (e.g. their workers OOM) than
> with a missing feature.
>
> Reuven
>
> On Fri, May 3, 2019 at 12:41 PM Kenneth Knowles  wrote:
>
>> All good points. My version of the two shuffle approach does not work at
>> all.
>>
>> On Fri, May 3, 2019 at 11:38 AM Brian Hulette 
>> wrote:
>>
>>> Rui's point about FLOAT/DOUBLE columns is interesting as well. We
>>> couldn't support distinct aggregations on floating point columns with the
>>> two-shuffle approach, but we could with the CombineFn approach. I'm not
>>> sure if that's a good thing or not, it seems like an anti-pattern to do a
>>> distinct aggregation on floating point numbers but I suppose the spec
>>> allows it.
>>>
>>
>> I can't find the Jira, but grouping on doubles has been discussed at some
>> length before. Many DBMSs do not provide this, so it is not generally
>> expected by SQL users. That is good, because mathematically it is
>> questionable - floating point is usually used as a stand-in for real
>> numbers, where computing equality is not generally possible. So any code
>> that actually depends on equality of floating points is likely susceptible
>> to rounding errors, other quirks of floating point, and also is probably
>> misguided because the underlying thing that floats are approximating
>> already cannot be checked for equality.
>>
>> Kenn
>>
>>
>>>
>>> Brian
>>>
>>>
>>> On Fri, May 3, 2019 at 10:52 AM Rui Wang  wrote:
>>>
 To clarify what I said "So two shuffle approach will lead to two
 different implementation for tables with and without FLOAT/DOUBLE column.":

 Basically I wanted to say that two shuffles approach will be an
 implementation for some cases, and it will co-exist with CombineFn
 approach. In the feature, when we start cost based optimization in
 BeamSQL,  CBO is supposed to compare different plans.

 -Rui

 On Fri, May 3, 2019 at 10:40 AM Rui Wang  wrote:

>
>> As to the distinct aggregations: At the least, these queries should
>> be rejected, not evaluated incorrectly.
>>
>
> Yes. The least is not to support it, and throws clear message to say
> no. (current implementation ignores DISTINCT and executes all aggregations
> as ALL).
>
>
>> The term "stateful CombineFn" is not one I would use, as the nature
>> of state is linearity and the nature of CombineFn is parallelism. So I
>> don't totally understand this proposal. If I replace stateful CombineFn
>> with stateful DoFn with one combining state per column, then I think I
>> understand. FWIW on a runner with scalable SetState or MapState it will 
>> not
>> be any risk at all.
>>
>> I see. "Stateful" is indeed misleading. In this thread, it was all
> about using simple CombineFn to achieve DISTINCT aggregation with massive
> parallelism.
>
> But if you go the two shuffle route, you don't have to separate the
>> aggregations and re-join them. You just have to incur the cost of the 
>> GBK +
>> DISTINCT for all columns, and just drop the secondary key for the second
>> shuffle, no?
>>
>> Two shuffle approach cannot be the unified approach because it
> requires to build a key of group_by_key + table_row to deduplicate, but
> table_row might contain floating point numbers, which cannot be used as 
> key
> in GBK. So two shuffle approach will lead to two different implementation
> for tables with and without FLOAT/DOUBLE column.
>
> CombineFn is the unified approach for distinct and non distinct
> aggregation: each aggregation call will be a CombineFn.
>
>
> -Rui
>
>
>
>> Kenn
>>
>> On Thu, May 2, 2019 at 5:11 PM Ahmet Altay  wrote:
>>
>>>
>>>
>>> On Thu, May 2, 2019 at 2:18 PM Rui Wang  wrote:
>>>
 Brian's first proposal is challenging also partially because in
 BeamSQL there is no good practice to deal with complex SQL plans. 
 Ideally
 we need enough rules and SQL plan node in Beam to construct
 easy-to-transform plans for different cases. I had a similar situation
 before when I needed to separate logical plans of  "JOIN ON a OR b"and
 "JOIN ON a AND b", which was because their implementation are so
 different to fit into the same JoinRelNode. It seems in similar 
 situation
 when both distinct aggregations and non-distinct aggregations are 
 mixed,
 one single AggregationRelNode is hard to encapsulate complex logical.

 We will need a detailed plan to 

Re: [DISCUSS][SQL] Providing support for DISTINCT aggregations

2019-05-03 Thread Reuven Lax
Back to the original point: I'm very skeptical of adding something that
does not scale at all. In our experience, users get far more upset with an
advertised feature that doesn't work for them (e.g. their workers OOM) than
with a missing feature.

Reuven

On Fri, May 3, 2019 at 12:41 PM Kenneth Knowles  wrote:

> All good points. My version of the two shuffle approach does not work at
> all.
>
> On Fri, May 3, 2019 at 11:38 AM Brian Hulette  wrote:
>
>> Rui's point about FLOAT/DOUBLE columns is interesting as well. We
>> couldn't support distinct aggregations on floating point columns with the
>> two-shuffle approach, but we could with the CombineFn approach. I'm not
>> sure if that's a good thing or not, it seems like an anti-pattern to do a
>> distinct aggregation on floating point numbers but I suppose the spec
>> allows it.
>>
>
> I can't find the Jira, but grouping on doubles has been discussed at some
> length before. Many DBMSs do not provide this, so it is not generally
> expected by SQL users. That is good, because mathematically it is
> questionable - floating point is usually used as a stand-in for real
> numbers, where computing equality is not generally possible. So any code
> that actually depends on equality of floating points is likely susceptible
> to rounding errors, other quirks of floating point, and also is probably
> misguided because the underlying thing that floats are approximating
> already cannot be checked for equality.
>
> Kenn
>
>
>>
>> Brian
>>
>>
>> On Fri, May 3, 2019 at 10:52 AM Rui Wang  wrote:
>>
>>> To clarify what I said "So two shuffle approach will lead to two
>>> different implementation for tables with and without FLOAT/DOUBLE column.":
>>>
>>> Basically I wanted to say that two shuffles approach will be an
>>> implementation for some cases, and it will co-exist with CombineFn
>>> approach. In the feature, when we start cost based optimization in
>>> BeamSQL,  CBO is supposed to compare different plans.
>>>
>>> -Rui
>>>
>>> On Fri, May 3, 2019 at 10:40 AM Rui Wang  wrote:
>>>

> As to the distinct aggregations: At the least, these queries should be
> rejected, not evaluated incorrectly.
>

 Yes. The least is not to support it, and throws clear message to say
 no. (current implementation ignores DISTINCT and executes all aggregations
 as ALL).


> The term "stateful CombineFn" is not one I would use, as the nature of
> state is linearity and the nature of CombineFn is parallelism. So I don't
> totally understand this proposal. If I replace stateful CombineFn with
> stateful DoFn with one combining state per column, then I think I
> understand. FWIW on a runner with scalable SetState or MapState it will 
> not
> be any risk at all.
>
> I see. "Stateful" is indeed misleading. In this thread, it was all
 about using simple CombineFn to achieve DISTINCT aggregation with massive
 parallelism.

 But if you go the two shuffle route, you don't have to separate the
> aggregations and re-join them. You just have to incur the cost of the GBK 
> +
> DISTINCT for all columns, and just drop the secondary key for the second
> shuffle, no?
>
> Two shuffle approach cannot be the unified approach because it
 requires to build a key of group_by_key + table_row to deduplicate, but
 table_row might contain floating point numbers, which cannot be used as key
 in GBK. So two shuffle approach will lead to two different implementation
 for tables with and without FLOAT/DOUBLE column.

 CombineFn is the unified approach for distinct and non distinct
 aggregation: each aggregation call will be a CombineFn.


 -Rui



> Kenn
>
> On Thu, May 2, 2019 at 5:11 PM Ahmet Altay  wrote:
>
>>
>>
>> On Thu, May 2, 2019 at 2:18 PM Rui Wang  wrote:
>>
>>> Brian's first proposal is challenging also partially because in
>>> BeamSQL there is no good practice to deal with complex SQL plans. 
>>> Ideally
>>> we need enough rules and SQL plan node in Beam to construct
>>> easy-to-transform plans for different cases. I had a similar situation
>>> before when I needed to separate logical plans of  "JOIN ON a OR b"and
>>> "JOIN ON a AND b", which was because their implementation are so
>>> different to fit into the same JoinRelNode. It seems in similar 
>>> situation
>>> when both distinct aggregations and non-distinct aggregations are mixed,
>>> one single AggregationRelNode is hard to encapsulate complex logical.
>>>
>>> We will need a detailed plan to re-think about RelNodes and Rules in
>>> BeamSQL, which is out of scope for supporting DISTINCT.
>>>
>>> I would favor of second proposal because BeamSQL uses Beam schema
>>> and row. Schema itself uses Java primitives for most of its types (int,
>>> long, float, etc.). It limits the size of each 

Re: Better naming for runner specific options

2019-05-03 Thread Kenneth Knowles
Even though they are in classes named for specific runners, they are not
namespaced. All PipelineOptions exist in a global namespace so they need to
be careful to be very precise.

It is a good point that even though they may be multiple uses for "machine
type" they are probably not going to both happen at the same time.

If it becomes an issue, another thing we could do would be to add
namespacing support so options have less spooky action, or at least have a
way to resolve it when it happens on accident.

Kenn

On Fri, May 3, 2019 at 10:43 AM Chamikara Jayalath 
wrote:

> Also, we do have runner specific options classes where truly runner
> specific options can go.
>
>
> https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/options/DataflowPipelineOptions.java
>
> https://github.com/apache/beam/blob/master/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkPipelineOptions.java
>
> On Fri, May 3, 2019 at 9:50 AM Ahmet Altay  wrote:
>
>> I agree, that is a good point.
>>
>> *From: *Lukasz Cwik 
>> *Date: *Fri, May 3, 2019 at 9:37 AM
>> *To: *dev
>>
>> The concept of a machine type isn't necessarily limited to Dataflow. If
>>> it made sense for a runner, they could use AWS/Azure machine types as well.
>>>
>>> On Fri, May 3, 2019 at 9:32 AM Ahmet Altay  wrote:
>>>
 This idea was discussed in a PR a few months ago, and JIRA was filed as
 a follow up [1]. IMO, it makes sense to use a namespace prefix. The primary
 issue here is that, such a change will very likely be a backward
 incompatible change and would be hard to do before the next major version.

 [1] https://issues.apache.org/jira/browse/BEAM-6531

 *From: *Reza Rokni 
 *Date: *Thu, May 2, 2019 at 8:00 PM
 *To: * 

 Hi,
>
> Was reading this SO question:
>
>
> https://stackoverflow.com/questions/53833171/googlecloudoptions-doesnt-have-all-options-that-pipeline-options-has
>
> And noticed that in
>
>
> https://beam.apache.org/releases/pydoc/2.12.0/_modules/apache_beam/options/pipeline_options.html#WorkerOptions
>
> The option is called --worker_machine_type.
>
> I wonder if runner specific options should have the runner in the
> prefix? Something like --dataflow_worker_machine_type?
>
> Cheers
> Reza
>
> --
>
> This email may be confidential and privileged. If you received this
> communication by mistake, please don't forward it to anyone else, please
> erase all copies and attachments, and please let me know that it has gone
> to the wrong person.
>
> The above terms reflect a potential business arrangement, are provided
> solely as a basis for further discussion, and are not intended to be and 
> do
> not constitute a legally binding obligation. No legally binding 
> obligations
> will be created, implied, or inferred until an agreement in final form is
> executed in writing by all parties involved.
>



Re: [DISCUSS] Should File based IOs implement readAll() or just readFiles()

2019-05-03 Thread Ismaël Mejía
For info both AvroIO ReadAll/ParseAll and TextIO ReadAll deprecations
were merged into master today and will be part of 2.13.0.

For those working in other SDKs (Python, Go) please pay attention to
not implement such transforms (or deprecate them too if already done)
to keep the API ideas coherent.

On Wed, Feb 6, 2019 at 11:27 AM Jean-Baptiste Onofré  wrote:
>
> +1
>
> Thanks for that Ismaël.
>
> Regards
> JB
>
> On 06/02/2019 11:24, Ismaël Mejía wrote:
> > Since it seems we have consensus on deprecating both transforms I created
> >
> > BEAM-6605 Deprecate TextIO.readAll() and TextIO.ReadAll transform
> > BEAM-6606 Deprecate AvroIO.readAll() and AvroIO.ReadAll transform
> >
> > Thanks everyone.
> >
> > On Fri, Feb 1, 2019 at 7:03 PM Chamikara Jayalath  
> > wrote:
> >>
> >> Python SDK doesn't have FileIO yet so let's keep ReadAllFromFoo transforms 
> >> currently available for various file types around till we have that.
> >>
> >> Thanks,
> >> Cham
> >>
> >> On Fri, Feb 1, 2019 at 7:41 AM Jean-Baptiste Onofré  
> >> wrote:
> >>>
> >>> Hi,
> >>>
> >>> readFiles() should be used IMHO. We should remove readAll() to avoid
> >>> confusion.
> >>>
> >>> Regards
> >>> JB
> >>>
> >>> On 30/01/2019 17:25, Ismaël Mejía wrote:
>  Hello,
> 
>  A ‘recent’ pattern of use in Beam is to have in file based IOs a
>  `readAll()` implementation that basically matches a `PCollection` of
>  file patterns and reads them, e.g. `TextIO`, `AvroIO`. `ReadAll` is
>  implemented by a expand function that matches files with FileIO and
>  then reads them using a format specific `ReadFiles` transform e.g.
>  TextIO.ReadFiles, AvroIO.ReadFiles. So in the end `ReadAll` in the
>  Java implementation is just an user friendly API to hide FileIO.match
>  + ReadFiles.
> 
>  Most recent IOs do NOT implement ReadAll to encourage the more
>  composable approach of File + ReadFiles, e.g. XmlIO and ParquetIO.
> 
>  Implementing ReadAll as a wrapper is relatively easy and is definitely
>  user friendly, but it has an  issue, it may be error-prone and it adds
>  more code to maintain (mostly ‘repeated’ code). However `readAll` is a
>  more abstract pattern that applies not only to File based IOs so it
>  makes sense for example in other transforms that map a `Pcollection`
>  of read requests and is the basis for SDF composable style APIs like
>  the recent `HBaseIO.readAll()`.
> 
>  So the question is should we:
> 
>  [1] Implement `readAll` in all file based IOs to be user friendly and
>  assume the (minor) maintenance cost
> 
>  or
> 
>  [2] Deprecate `readAll` from file based IOs and encourage users to use
>  FileIO + `readFiles` (less maintenance and encourage composition).
> 
>  I just checked quickly in the python code base but I did not find if
>  the File match + ReadFiles pattern applies, but it would be nice to
>  see what the python guys think on this too.
> 
>  This discussion comes from a recent slack conversation with Łukasz
>  Gajowy, and we wanted to settle into one approach to make the IO
>  signatures consistent, so any opinions/preferences?
> 
> >>>
> >>> --
> >>> Jean-Baptiste Onofré
> >>> jbono...@apache.org
> >>> http://blog.nanthrax.net
> >>> Talend - http://www.talend.com
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com


Re: [DISCUSS][SQL] Providing support for DISTINCT aggregations

2019-05-03 Thread Kenneth Knowles
All good points. My version of the two shuffle approach does not work at
all.

On Fri, May 3, 2019 at 11:38 AM Brian Hulette  wrote:

> Rui's point about FLOAT/DOUBLE columns is interesting as well. We couldn't
> support distinct aggregations on floating point columns with the
> two-shuffle approach, but we could with the CombineFn approach. I'm not
> sure if that's a good thing or not, it seems like an anti-pattern to do a
> distinct aggregation on floating point numbers but I suppose the spec
> allows it.
>

I can't find the Jira, but grouping on doubles has been discussed at some
length before. Many DBMSs do not provide this, so it is not generally
expected by SQL users. That is good, because mathematically it is
questionable - floating point is usually used as a stand-in for real
numbers, where computing equality is not generally possible. So any code
that actually depends on equality of floating points is likely susceptible
to rounding errors, other quirks of floating point, and also is probably
misguided because the underlying thing that floats are approximating
already cannot be checked for equality.

Kenn


>
> Brian
>
>
> On Fri, May 3, 2019 at 10:52 AM Rui Wang  wrote:
>
>> To clarify what I said "So two shuffle approach will lead to two
>> different implementation for tables with and without FLOAT/DOUBLE column.":
>>
>> Basically I wanted to say that two shuffles approach will be an
>> implementation for some cases, and it will co-exist with CombineFn
>> approach. In the feature, when we start cost based optimization in
>> BeamSQL,  CBO is supposed to compare different plans.
>>
>> -Rui
>>
>> On Fri, May 3, 2019 at 10:40 AM Rui Wang  wrote:
>>
>>>
 As to the distinct aggregations: At the least, these queries should be
 rejected, not evaluated incorrectly.

>>>
>>> Yes. The least is not to support it, and throws clear message to say no.
>>> (current implementation ignores DISTINCT and executes all aggregations as
>>> ALL).
>>>
>>>
 The term "stateful CombineFn" is not one I would use, as the nature of
 state is linearity and the nature of CombineFn is parallelism. So I don't
 totally understand this proposal. If I replace stateful CombineFn with
 stateful DoFn with one combining state per column, then I think I
 understand. FWIW on a runner with scalable SetState or MapState it will not
 be any risk at all.

 I see. "Stateful" is indeed misleading. In this thread, it was all
>>> about using simple CombineFn to achieve DISTINCT aggregation with massive
>>> parallelism.
>>>
>>> But if you go the two shuffle route, you don't have to separate the
 aggregations and re-join them. You just have to incur the cost of the GBK +
 DISTINCT for all columns, and just drop the secondary key for the second
 shuffle, no?

 Two shuffle approach cannot be the unified approach because it requires
>>> to build a key of group_by_key + table_row to deduplicate, but table_row
>>> might contain floating point numbers, which cannot be used as key in GBK.
>>> So two shuffle approach will lead to two different implementation for
>>> tables with and without FLOAT/DOUBLE column.
>>>
>>> CombineFn is the unified approach for distinct and non distinct
>>> aggregation: each aggregation call will be a CombineFn.
>>>
>>>
>>> -Rui
>>>
>>>
>>>
 Kenn

 On Thu, May 2, 2019 at 5:11 PM Ahmet Altay  wrote:

>
>
> On Thu, May 2, 2019 at 2:18 PM Rui Wang  wrote:
>
>> Brian's first proposal is challenging also partially because in
>> BeamSQL there is no good practice to deal with complex SQL plans. Ideally
>> we need enough rules and SQL plan node in Beam to construct
>> easy-to-transform plans for different cases. I had a similar situation
>> before when I needed to separate logical plans of  "JOIN ON a OR b"and
>> "JOIN ON a AND b", which was because their implementation are so
>> different to fit into the same JoinRelNode. It seems in similar situation
>> when both distinct aggregations and non-distinct aggregations are mixed,
>> one single AggregationRelNode is hard to encapsulate complex logical.
>>
>> We will need a detailed plan to re-think about RelNodes and Rules in
>> BeamSQL, which is out of scope for supporting DISTINCT.
>>
>> I would favor of second proposal because BeamSQL uses Beam schema and
>> row. Schema itself uses Java primitives for most of its types (int, long,
>> float, etc.). It limits the size of each element. Considering per key and
>> per window combine, there is a good chance that stateful combine works 
>> for
>> some (if not most) cases.
>>
>> Could we use @Experimental to tag stateful combine for supporting
>> DISTINCT in aggregation so that we could have a chance to test it by 
>> users?
>>
>>
>> -Rui
>>
>>
>>
>> On Thu, May 2, 2019 at 1:16 PM Brian Hulette 
>> wrote:
>>
>>> Ahmet -

Re: kafka client interoperability

2019-05-03 Thread Moorhead,Richard
We attempted a downgrade to beam-sdks-java-io-kafka 2.9 while using 2.10 for 
the rest and ran into issues. I still see checks to the ConsumerSpel throughout 
ProducerRecordCoder and I am beginning to think this is a bug.


From: Juan Carlos Garcia 
Sent: Thursday, May 2, 2019 11:10 PM
To: u...@beam.apache.org
Cc: dev
Subject: Re: kafka client interoperability

Downgrade only the KafkaIO module to the version that works for you (also 
excluding any transient dependency of it) that works for us.

JC.

Lukasz Cwik mailto:lc...@google.com>> schrieb am Do., 2. Mai 
2019, 20:05:
+dev

On Thu, May 2, 2019 at 10:34 AM Moorhead,Richard 
mailto:richard.moorhe...@cerner.com>> wrote:
In Beam 2.9.0, this check was made:

https://github.com/apache/beam/blob/2ba00576e3a708bb961a3c64a2241d9ab32ab5b3/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaRecordCoder.java#L132

However this logic was removed in 2.10+ in the newer ProducerRecordCoder class:

https://github.com/apache/beam/blob/release-2.10.0/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/ProducerRecordCoder.java#L137


We are attempting to use Beam 2.10 with kafka 0.10.2.1; this is advertised as 
supported here:
https://beam.apache.org/releases/javadoc/2.10.0/org/apache/beam/sdk/io/kafka/KafkaIO.html

However we are experiencing issues with the `headers` method call mentioned 
above. Is there a way around this?




CONFIDENTIALITY NOTICE This message and any included attachments are from 
Cerner Corporation and are intended only for the addressee. The information 
contained in this message is confidential and may constitute inside or 
non-public information under international, federal, or state securities laws. 
Unauthorized forwarding, printing, copying, distribution, or use of such 
information is strictly prohibited and may be unlawful. If you are not the 
addressee, please promptly delete this message and notify the sender of the 
delivery error by e-mail or you may call Cerner's corporate offices in Kansas 
City, Missouri, U.S.A at (+1) (816)221-1024.


Re: [DISCUSS][SQL] Providing support for DISTINCT aggregations

2019-05-03 Thread Rui Wang
To clarify what I said "So two shuffle approach will lead to two different
implementation for tables with and without FLOAT/DOUBLE column.":

Basically I wanted to say that two shuffles approach will be an
implementation for some cases, and it will co-exist with CombineFn
approach. In the feature, when we start cost based optimization in
BeamSQL,  CBO is supposed to compare different plans.

-Rui

On Fri, May 3, 2019 at 10:40 AM Rui Wang  wrote:

>
>> As to the distinct aggregations: At the least, these queries should be
>> rejected, not evaluated incorrectly.
>>
>
> Yes. The least is not to support it, and throws clear message to say no.
> (current implementation ignores DISTINCT and executes all aggregations as
> ALL).
>
>
>> The term "stateful CombineFn" is not one I would use, as the nature of
>> state is linearity and the nature of CombineFn is parallelism. So I don't
>> totally understand this proposal. If I replace stateful CombineFn with
>> stateful DoFn with one combining state per column, then I think I
>> understand. FWIW on a runner with scalable SetState or MapState it will not
>> be any risk at all.
>>
>> I see. "Stateful" is indeed misleading. In this thread, it was all about
> using simple CombineFn to achieve DISTINCT aggregation with massive
> parallelism.
>
> But if you go the two shuffle route, you don't have to separate the
>> aggregations and re-join them. You just have to incur the cost of the GBK +
>> DISTINCT for all columns, and just drop the secondary key for the second
>> shuffle, no?
>>
>> Two shuffle approach cannot be the unified approach because it requires
> to build a key of group_by_key + table_row to deduplicate, but table_row
> might contain floating point numbers, which cannot be used as key in GBK.
> So two shuffle approach will lead to two different implementation for
> tables with and without FLOAT/DOUBLE column.
>
> CombineFn is the unified approach for distinct and non distinct
> aggregation: each aggregation call will be a CombineFn.
>
>
> -Rui
>
>
>
>> Kenn
>>
>> On Thu, May 2, 2019 at 5:11 PM Ahmet Altay  wrote:
>>
>>>
>>>
>>> On Thu, May 2, 2019 at 2:18 PM Rui Wang  wrote:
>>>
 Brian's first proposal is challenging also partially because in BeamSQL
 there is no good practice to deal with complex SQL plans. Ideally we need
 enough rules and SQL plan node in Beam to construct easy-to-transform plans
 for different cases. I had a similar situation before when I needed to
 separate logical plans of  "JOIN ON a OR b"and "JOIN ON a AND b", which was
 because their implementation are so different to fit into the same
 JoinRelNode. It seems in similar situation when both distinct aggregations
 and non-distinct aggregations are mixed, one single AggregationRelNode is
 hard to encapsulate complex logical.

 We will need a detailed plan to re-think about RelNodes and Rules in
 BeamSQL, which is out of scope for supporting DISTINCT.

 I would favor of second proposal because BeamSQL uses Beam schema and
 row. Schema itself uses Java primitives for most of its types (int, long,
 float, etc.). It limits the size of each element. Considering per key and
 per window combine, there is a good chance that stateful combine works for
 some (if not most) cases.

 Could we use @Experimental to tag stateful combine for supporting
 DISTINCT in aggregation so that we could have a chance to test it by users?


 -Rui



 On Thu, May 2, 2019 at 1:16 PM Brian Hulette 
 wrote:

> Ahmet -
> I think it would only require observing each key's partition of the
> input independently, and the size of the state would only be proportional
> to the number of distinct elements, not the entire input. Note the 
> pipeline
> would be a GBK with a key based on the GROUP BY, followed by a
> Combined.GroupedValue with a (possibly very stateful) CombineFn.
>

>>> Got it. Distinct elements could be proportional to the entire input,
>>> however if this is a reasonable requirement from a product perspective that
>>> is fine. Rui's suggestion of using experimental tag is also a good idea. I
>>> supposed that will give us the ability to change the implementation if it
>>> becomes necessary.
>>>
>>>

> Luke -
> Here's A little background on why I think (1) is harder (It may also
> just be that it looks daunting to me as someone who's not that familiar
> with the code).
>
> An aggregation node can have multiple aggregations. So, for example,
> the query `SELECT k, SUM(x), COUNT(DISTINCT y), AVG(DISTINCT z) FROM ...`
> would yield a logical plan that has a single aggregation node with three
> different aggregations. We then take that node and build up a CombineFn
> that is a composite of all of the aggregations we need to make a combining
> PTransform [1]. To implement (1) we would need to distinguish between all

Re: Better naming for runner specific options

2019-05-03 Thread Chamikara Jayalath
Also, we do have runner specific options classes where truly runner
specific options can go.

https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/options/DataflowPipelineOptions.java
https://github.com/apache/beam/blob/master/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkPipelineOptions.java

On Fri, May 3, 2019 at 9:50 AM Ahmet Altay  wrote:

> I agree, that is a good point.
>
> *From: *Lukasz Cwik 
> *Date: *Fri, May 3, 2019 at 9:37 AM
> *To: *dev
>
> The concept of a machine type isn't necessarily limited to Dataflow. If it
>> made sense for a runner, they could use AWS/Azure machine types as well.
>>
>> On Fri, May 3, 2019 at 9:32 AM Ahmet Altay  wrote:
>>
>>> This idea was discussed in a PR a few months ago, and JIRA was filed as
>>> a follow up [1]. IMO, it makes sense to use a namespace prefix. The primary
>>> issue here is that, such a change will very likely be a backward
>>> incompatible change and would be hard to do before the next major version.
>>>
>>> [1] https://issues.apache.org/jira/browse/BEAM-6531
>>>
>>> *From: *Reza Rokni 
>>> *Date: *Thu, May 2, 2019 at 8:00 PM
>>> *To: * 
>>>
>>> Hi,

 Was reading this SO question:


 https://stackoverflow.com/questions/53833171/googlecloudoptions-doesnt-have-all-options-that-pipeline-options-has

 And noticed that in


 https://beam.apache.org/releases/pydoc/2.12.0/_modules/apache_beam/options/pipeline_options.html#WorkerOptions

 The option is called --worker_machine_type.

 I wonder if runner specific options should have the runner in the
 prefix? Something like --dataflow_worker_machine_type?

 Cheers
 Reza

 --

 This email may be confidential and privileged. If you received this
 communication by mistake, please don't forward it to anyone else, please
 erase all copies and attachments, and please let me know that it has gone
 to the wrong person.

 The above terms reflect a potential business arrangement, are provided
 solely as a basis for further discussion, and are not intended to be and do
 not constitute a legally binding obligation. No legally binding obligations
 will be created, implied, or inferred until an agreement in final form is
 executed in writing by all parties involved.

>>>


Re: [DISCUSS][SQL] Providing support for DISTINCT aggregations

2019-05-03 Thread Rui Wang
>
>
> As to the distinct aggregations: At the least, these queries should be
> rejected, not evaluated incorrectly.
>

Yes. The least is not to support it, and throws clear message to say no.
(current implementation ignores DISTINCT and executes all aggregations as
ALL).


> The term "stateful CombineFn" is not one I would use, as the nature of
> state is linearity and the nature of CombineFn is parallelism. So I don't
> totally understand this proposal. If I replace stateful CombineFn with
> stateful DoFn with one combining state per column, then I think I
> understand. FWIW on a runner with scalable SetState or MapState it will not
> be any risk at all.
>
> I see. "Stateful" is indeed misleading. In this thread, it was all about
using simple CombineFn to achieve DISTINCT aggregation with massive
parallelism.

But if you go the two shuffle route, you don't have to separate the
> aggregations and re-join them. You just have to incur the cost of the GBK +
> DISTINCT for all columns, and just drop the secondary key for the second
> shuffle, no?
>
> Two shuffle approach cannot be the unified approach because it requires to
build a key of group_by_key + table_row to deduplicate, but table_row might
contain floating point numbers, which cannot be used as key in GBK. So two
shuffle approach will lead to two different implementation for tables with
and without FLOAT/DOUBLE column.

CombineFn is the unified approach for distinct and non distinct
aggregation: each aggregation call will be a CombineFn.


-Rui



> Kenn
>
> On Thu, May 2, 2019 at 5:11 PM Ahmet Altay  wrote:
>
>>
>>
>> On Thu, May 2, 2019 at 2:18 PM Rui Wang  wrote:
>>
>>> Brian's first proposal is challenging also partially because in BeamSQL
>>> there is no good practice to deal with complex SQL plans. Ideally we need
>>> enough rules and SQL plan node in Beam to construct easy-to-transform plans
>>> for different cases. I had a similar situation before when I needed to
>>> separate logical plans of  "JOIN ON a OR b"and "JOIN ON a AND b", which was
>>> because their implementation are so different to fit into the same
>>> JoinRelNode. It seems in similar situation when both distinct aggregations
>>> and non-distinct aggregations are mixed, one single AggregationRelNode is
>>> hard to encapsulate complex logical.
>>>
>>> We will need a detailed plan to re-think about RelNodes and Rules in
>>> BeamSQL, which is out of scope for supporting DISTINCT.
>>>
>>> I would favor of second proposal because BeamSQL uses Beam schema and
>>> row. Schema itself uses Java primitives for most of its types (int, long,
>>> float, etc.). It limits the size of each element. Considering per key and
>>> per window combine, there is a good chance that stateful combine works for
>>> some (if not most) cases.
>>>
>>> Could we use @Experimental to tag stateful combine for supporting
>>> DISTINCT in aggregation so that we could have a chance to test it by users?
>>>
>>>
>>> -Rui
>>>
>>>
>>>
>>> On Thu, May 2, 2019 at 1:16 PM Brian Hulette 
>>> wrote:
>>>
 Ahmet -
 I think it would only require observing each key's partition of the
 input independently, and the size of the state would only be proportional
 to the number of distinct elements, not the entire input. Note the pipeline
 would be a GBK with a key based on the GROUP BY, followed by a
 Combined.GroupedValue with a (possibly very stateful) CombineFn.

>>>
>> Got it. Distinct elements could be proportional to the entire input,
>> however if this is a reasonable requirement from a product perspective that
>> is fine. Rui's suggestion of using experimental tag is also a good idea. I
>> supposed that will give us the ability to change the implementation if it
>> becomes necessary.
>>
>>
>>>
 Luke -
 Here's A little background on why I think (1) is harder (It may also
 just be that it looks daunting to me as someone who's not that familiar
 with the code).

 An aggregation node can have multiple aggregations. So, for example,
 the query `SELECT k, SUM(x), COUNT(DISTINCT y), AVG(DISTINCT z) FROM ...`
 would yield a logical plan that has a single aggregation node with three
 different aggregations. We then take that node and build up a CombineFn
 that is a composite of all of the aggregations we need to make a combining
 PTransform [1]. To implement (1) we would need to distinguish between all
 the DISTINCT and non-DISTINCT aggregations, and come up with a way to unify
 the 2-GBK DISTINCT pipeline and the 1-GBK non-DISTINCT pipeline.

 That's certainly not unsolvable, but approach (2) is much simpler - it
 just requires implementing some variations on the CombineFn's that already
 exist [2] and re-using the existing logic for converting an aggregation
 node to a combining PTransform.


 Hopefully that makes sense, let me know if I need to clarify further :)
 Brian

 [1]
 

Re: kafka client interoperability

2019-05-03 Thread Juan Carlos Garcia
Downgrade only the KafkaIO module to the version that works for you (also
excluding any transient dependency of it) that works for us.

JC.

Lukasz Cwik  schrieb am Do., 2. Mai 2019, 20:05:

> +dev 
>
> On Thu, May 2, 2019 at 10:34 AM Moorhead,Richard <
> richard.moorhe...@cerner.com> wrote:
>
>> In Beam 2.9.0, this check was made:
>>
>>
>> https://github.com/apache/beam/blob/2ba00576e3a708bb961a3c64a2241d9ab32ab5b3/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaRecordCoder.java#L132
>>
>> However this logic was removed in 2.10+ in the newer ProducerRecordCoder
>> class:
>>
>>
>> https://github.com/apache/beam/blob/release-2.10.0/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/ProducerRecordCoder.java#L137
>>
>>
>> We are attempting to use Beam 2.10 with kafka 0.10.2.1; this is
>> advertised as supported here:
>>
>> https://beam.apache.org/releases/javadoc/2.10.0/org/apache/beam/sdk/io/kafka/KafkaIO.html
>>
>> However we are experiencing issues with the `headers` method call
>> mentioned above. Is there a way around this?
>>
>>
>>
>>
>> CONFIDENTIALITY NOTICE This message and any included attachments are from
>> Cerner Corporation and are intended only for the addressee. The information
>> contained in this message is confidential and may constitute inside or
>> non-public information under international, federal, or state securities
>> laws. Unauthorized forwarding, printing, copying, distribution, or use of
>> such information is strictly prohibited and may be unlawful. If you are not
>> the addressee, please promptly delete this message and notify the sender of
>> the delivery error by e-mail or you may call Cerner's corporate offices in
>> Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
>>
>


Re: Better naming for runner specific options

2019-05-03 Thread Ahmet Altay
I agree, that is a good point.

*From: *Lukasz Cwik 
*Date: *Fri, May 3, 2019 at 9:37 AM
*To: *dev

The concept of a machine type isn't necessarily limited to Dataflow. If it
> made sense for a runner, they could use AWS/Azure machine types as well.
>
> On Fri, May 3, 2019 at 9:32 AM Ahmet Altay  wrote:
>
>> This idea was discussed in a PR a few months ago, and JIRA was filed as a
>> follow up [1]. IMO, it makes sense to use a namespace prefix. The primary
>> issue here is that, such a change will very likely be a backward
>> incompatible change and would be hard to do before the next major version.
>>
>> [1] https://issues.apache.org/jira/browse/BEAM-6531
>>
>> *From: *Reza Rokni 
>> *Date: *Thu, May 2, 2019 at 8:00 PM
>> *To: * 
>>
>> Hi,
>>>
>>> Was reading this SO question:
>>>
>>>
>>> https://stackoverflow.com/questions/53833171/googlecloudoptions-doesnt-have-all-options-that-pipeline-options-has
>>>
>>> And noticed that in
>>>
>>>
>>> https://beam.apache.org/releases/pydoc/2.12.0/_modules/apache_beam/options/pipeline_options.html#WorkerOptions
>>>
>>> The option is called --worker_machine_type.
>>>
>>> I wonder if runner specific options should have the runner in the
>>> prefix? Something like --dataflow_worker_machine_type?
>>>
>>> Cheers
>>> Reza
>>>
>>> --
>>>
>>> This email may be confidential and privileged. If you received this
>>> communication by mistake, please don't forward it to anyone else, please
>>> erase all copies and attachments, and please let me know that it has gone
>>> to the wrong person.
>>>
>>> The above terms reflect a potential business arrangement, are provided
>>> solely as a basis for further discussion, and are not intended to be and do
>>> not constitute a legally binding obligation. No legally binding obligations
>>> will be created, implied, or inferred until an agreement in final form is
>>> executed in writing by all parties involved.
>>>
>>


Re: Better naming for runner specific options

2019-05-03 Thread Lukasz Cwik
The concept of a machine type isn't necessarily limited to Dataflow. If it
made sense for a runner, they could use AWS/Azure machine types as well.

On Fri, May 3, 2019 at 9:32 AM Ahmet Altay  wrote:

> This idea was discussed in a PR a few months ago, and JIRA was filed as a
> follow up [1]. IMO, it makes sense to use a namespace prefix. The primary
> issue here is that, such a change will very likely be a backward
> incompatible change and would be hard to do before the next major version.
>
> [1] https://issues.apache.org/jira/browse/BEAM-6531
>
> *From: *Reza Rokni 
> *Date: *Thu, May 2, 2019 at 8:00 PM
> *To: * 
>
> Hi,
>>
>> Was reading this SO question:
>>
>>
>> https://stackoverflow.com/questions/53833171/googlecloudoptions-doesnt-have-all-options-that-pipeline-options-has
>>
>> And noticed that in
>>
>>
>> https://beam.apache.org/releases/pydoc/2.12.0/_modules/apache_beam/options/pipeline_options.html#WorkerOptions
>>
>> The option is called --worker_machine_type.
>>
>> I wonder if runner specific options should have the runner in the prefix?
>> Something like --dataflow_worker_machine_type?
>>
>> Cheers
>> Reza
>>
>> --
>>
>> This email may be confidential and privileged. If you received this
>> communication by mistake, please don't forward it to anyone else, please
>> erase all copies and attachments, and please let me know that it has gone
>> to the wrong person.
>>
>> The above terms reflect a potential business arrangement, are provided
>> solely as a basis for further discussion, and are not intended to be and do
>> not constitute a legally binding obligation. No legally binding obligations
>> will be created, implied, or inferred until an agreement in final form is
>> executed in writing by all parties involved.
>>
>


Re: Better naming for runner specific options

2019-05-03 Thread Ahmet Altay
This idea was discussed in a PR a few months ago, and JIRA was filed as a
follow up [1]. IMO, it makes sense to use a namespace prefix. The primary
issue here is that, such a change will very likely be a backward
incompatible change and would be hard to do before the next major version.

[1] https://issues.apache.org/jira/browse/BEAM-6531

*From: *Reza Rokni 
*Date: *Thu, May 2, 2019 at 8:00 PM
*To: * 

Hi,
>
> Was reading this SO question:
>
>
> https://stackoverflow.com/questions/53833171/googlecloudoptions-doesnt-have-all-options-that-pipeline-options-has
>
> And noticed that in
>
>
> https://beam.apache.org/releases/pydoc/2.12.0/_modules/apache_beam/options/pipeline_options.html#WorkerOptions
>
> The option is called --worker_machine_type.
>
> I wonder if runner specific options should have the runner in the prefix?
> Something like --dataflow_worker_machine_type?
>
> Cheers
> Reza
>
> --
>
> This email may be confidential and privileged. If you received this
> communication by mistake, please don't forward it to anyone else, please
> erase all copies and attachments, and please let me know that it has gone
> to the wrong person.
>
> The above terms reflect a potential business arrangement, are provided
> solely as a basis for further discussion, and are not intended to be and do
> not constitute a legally binding obligation. No legally binding obligations
> will be created, implied, or inferred until an agreement in final form is
> executed in writing by all parties involved.
>


Re: :beam-sdks-java-io-hadoop-input-format:test is extremely flaky

2019-05-03 Thread Alexey Romanenko
FYI: In the end, the module "hadoop-input-format” was removed in favour of 
using “hadoop-format” instead.

> On 29 Apr 2019, at 15:50, Jean-Baptiste Onofré  wrote:
> 
> Agree, +1
> 
> Regards
> JB
> 
> On 29/04/2019 15:30, Ismaël Mejía wrote:
>> +1 to remove it on this release, this is a maintenance pain for no real 
>> reason.
>> 
>> On Mon, Apr 29, 2019 at 3:06 PM Alexey Romanenko
>>  wrote:
>>> 
>>> Despite the fact that after fixing an issue with ports allocation (thanks 
>>> to Etienne!) for embedded Cassandra cluster (it’s used in 
>>> hadoop-input-format and this was the main cause of flakiness) it's got much 
>>> better, I’m 100% pro to remove this module since it’s already been 
>>> deprecated for several last releases.
>>> 
>>> PS: Just an observation when I was digging into PreCommit jobs results - 
>>> “org.apache.beam.runners.dataflow.worker.fn.BeamFnControlServiceTest.testClientConnecting”
>>>  fails quite often in the last time. Anyone works on this?
>>> 
 On 29 Apr 2019, at 14:19, Maximilian Michels  wrote:
 
 I don't know what going on with it but I agree it's annoying.
 
 Came across https://jira.apache.org/jira/browse/BEAM-6247, maybe it is 
 time to remove this module for the next release?
 
 -Max
 
 On 26.04.19 20:10, Reuven Lax wrote:
> I find I usually have to rerun Presubmit multiple times to get a green 
> run, and this test is one of the biggest culprits (though it's not the 
> only culprit). Does anyone know what's going on with it?
> Reuven
>>> 
> 
> -- 
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com



Re: kafka client interoperability

2019-05-03 Thread Alexey Romanenko
Oops, I see that Richard already created a Jira about that, so I close mine as 
a duplicate. 

> On 3 May 2019, at 15:58, Alexey Romanenko  wrote:
> 
> Thank you for reporting this. 
> 
> Seems like it’s a bug there (since ProducerRecord from kafka-clients:0.10.2.1 
> doesn’t support headers), so I created a Jira for that:
> https://issues.apache.org/jira/browse/BEAM-7217 
> 
> 
> Unfortunately, I can’t reproduce it on my machine. Could you add your pom 
> file and example of your pipeline into jira? 
> 
> As a workaround, I’d suggest to try to use kafka-clients with version >= 
> 0.11.0.2 (if it’s possible).
> 
> 
>> On 3 May 2019, at 14:12, Moorhead,Richard > > wrote:
>> 
>> We attempted a downgrade to beam-sdks-java-io-kafka 2.9 while using 2.10 for 
>> the rest and ran into issues. I still see checks to the ConsumerSpel 
>> throughout ProducerRecordCoder and I am beginning to think this is a bug.
>> 
>> From: Juan Carlos Garcia mailto:jcgarc...@gmail.com>>
>> Sent: Thursday, May 2, 2019 11:10 PM
>> To: u...@beam.apache.org 
>> Cc: dev
>> Subject: Re: kafka client interoperability
>>  
>> Downgrade only the KafkaIO module to the version that works for you (also 
>> excluding any transient dependency of it) that works for us.
>> 
>> JC. 
>> 
>> Lukasz Cwik mailto:lc...@google.com>> schrieb am Do., 2. 
>> Mai 2019, 20:05:
>> +dev  
>> 
>> On Thu, May 2, 2019 at 10:34 AM Moorhead,Richard 
>> mailto:richard.moorhe...@cerner.com>> wrote:
>> In Beam 2.9.0, this check was made:
>> 
>> https://github.com/apache/beam/blob/2ba00576e3a708bb961a3c64a2241d9ab32ab5b3/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaRecordCoder.java#L132
>>  
>> 
>> 
>> However this logic was removed in 2.10+ in the newer ProducerRecordCoder 
>> class:
>> 
>> https://github.com/apache/beam/blob/release-2.10.0/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/ProducerRecordCoder.java#L137
>>  
>> 
>> 
>> 
>> We are attempting to use Beam 2.10 with kafka 0.10.2.1; this is advertised 
>> as supported here:
>> https://beam.apache.org/releases/javadoc/2.10.0/org/apache/beam/sdk/io/kafka/KafkaIO.html
>>  
>> 
>> 
>> However we are experiencing issues with the `headers` method call mentioned 
>> above. Is there a way around this?
>> 
>> 
>> CONFIDENTIALITY NOTICE This message and any included attachments are from 
>> Cerner Corporation and are intended only for the addressee. The information 
>> contained in this message is confidential and may constitute inside or 
>> non-public information under international, federal, or state securities 
>> laws. Unauthorized forwarding, printing, copying, distribution, or use of 
>> such information is strictly prohibited and may be unlawful. If you are not 
>> the addressee, please promptly delete this message and notify the sender of 
>> the delivery error by e-mail or you may call Cerner's corporate offices in 
>> Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
> 



Re: kafka client interoperability

2019-05-03 Thread Alexey Romanenko
Thank you for reporting this. 

Seems like it’s a bug there (since ProducerRecord from kafka-clients:0.10.2.1 
doesn’t support headers), so I created a Jira for that:
https://issues.apache.org/jira/browse/BEAM-7217 


Unfortunately, I can’t reproduce it on my machine. Could you add your pom file 
and example of your pipeline into jira? 

As a workaround, I’d suggest to try to use kafka-clients with version >= 
0.11.0.2 (if it’s possible).


> On 3 May 2019, at 14:12, Moorhead,Richard  
> wrote:
> 
> We attempted a downgrade to beam-sdks-java-io-kafka 2.9 while using 2.10 for 
> the rest and ran into issues. I still see checks to the ConsumerSpel 
> throughout ProducerRecordCoder and I am beginning to think this is a bug.
> 
> From: Juan Carlos Garcia 
> Sent: Thursday, May 2, 2019 11:10 PM
> To: u...@beam.apache.org
> Cc: dev
> Subject: Re: kafka client interoperability
>  
> Downgrade only the KafkaIO module to the version that works for you (also 
> excluding any transient dependency of it) that works for us.
> 
> JC. 
> 
> Lukasz Cwik mailto:lc...@google.com>> schrieb am Do., 2. 
> Mai 2019, 20:05:
> +dev  
> 
> On Thu, May 2, 2019 at 10:34 AM Moorhead,Richard 
> mailto:richard.moorhe...@cerner.com>> wrote:
> In Beam 2.9.0, this check was made:
> 
> https://github.com/apache/beam/blob/2ba00576e3a708bb961a3c64a2241d9ab32ab5b3/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaRecordCoder.java#L132
>  
> 
> 
> However this logic was removed in 2.10+ in the newer ProducerRecordCoder 
> class:
> 
> https://github.com/apache/beam/blob/release-2.10.0/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/ProducerRecordCoder.java#L137
>  
> 
> 
> 
> We are attempting to use Beam 2.10 with kafka 0.10.2.1; this is advertised as 
> supported here:
> https://beam.apache.org/releases/javadoc/2.10.0/org/apache/beam/sdk/io/kafka/KafkaIO.html
>  
> 
> 
> However we are experiencing issues with the `headers` method call mentioned 
> above. Is there a way around this?
> 
> 
> CONFIDENTIALITY NOTICE This message and any included attachments are from 
> Cerner Corporation and are intended only for the addressee. The information 
> contained in this message is confidential and may constitute inside or 
> non-public information under international, federal, or state securities 
> laws. Unauthorized forwarding, printing, copying, distribution, or use of 
> such information is strictly prohibited and may be unlawful. If you are not 
> the addressee, please promptly delete this message and notify the sender of 
> the delivery error by e-mail or you may call Cerner's corporate offices in 
> Kansas City, Missouri, U.S.A at (+1) (816)221-1024.



Re: [DISCUSS] Performance of Beam compare to "Bare Runner"

2019-05-03 Thread Maximilian Michels
Misread your post. You're saying that Kryo is more efficient that a 
roundtrip obj->bytes->obj_copy. Still, most types use Flink's 
serializers which also do the above roundtrip. So I'm not sure this 
performance advantage holds true for other Flink jobs.


On 02.05.19 20:01, Maximilian Michels wrote:
I am not sure what are you referring to here. What do you mean Kryo is 
simply slower ... Beam Kryo or Flink Kryo or?


Flink uses Kryo as a fallback serializer when its own type serialization 
system can't analyze the type. I'm just guessing here that this could be 
slower.


On 02.05.19 16:51, Jozef Vilcek wrote:



On Thu, May 2, 2019 at 3:41 PM Maximilian Michels > wrote:


    Thanks for the JIRA issues Jozef!

 > So the feature in Flink is operator chaining and Flink per
    default initiate copy of input elements. In case of Beam coders copy
    seems to be more noticable than native Flink.

    Copying between chained operators can be turned off in the
    FlinkPipelineOptions (if you know what you're doing).


Yes, I know that it can be instracted to reuse objects (if you are 
referring to this). I am just not sure I want to open this door in 
general :)
But it is interesting to learn, that with portability, this will be 
turned On per default. Quite important finding imho.


    Beam coders should
    not be slower than Flink's. They are simple wrapped. It seems Kryo is
    simply slower which we could fix by providing more type hints to 
Flink.



I am not sure what are you referring to here. What do you mean Kryo is 
simply slower ... Beam Kryo or Flink Kryo or?


    -Max

    On 02.05.19 13:15, Robert Bradshaw wrote:
 > Thanks for filing those.
 >
 > As for how not doing a copy is "safe," it's not really. Beam 
simply

 > asserts that you MUST NOT mutate your inputs (and direct runners,
 > which are used during testing, do perform extra copies and 
checks to

 > catch violations of this requirement).
 >
 > On Thu, May 2, 2019 at 1:02 PM Jozef Vilcek
    mailto:jozo.vil...@gmail.com>> wrote:
 >>
 >> I have created
 >> https://issues.apache.org/jira/browse/BEAM-7204
 >> https://issues.apache.org/jira/browse/BEAM-7206
 >>
 >> to track these topics further
 >>
 >> On Wed, May 1, 2019 at 1:24 PM Jozef Vilcek
    mailto:jozo.vil...@gmail.com>> wrote:
 >>>
 >>>
 >>>
 >>> On Tue, Apr 30, 2019 at 5:42 PM Kenneth Knowles
    mailto:k...@apache.org>> wrote:
 
 
 
  On Tue, Apr 30, 2019, 07:05 Reuven Lax mailto:re...@google.com>> wrote:
 >
 > In that case, Robert's point is quite valid. The old Flink
    runner I believe had no knowledge of fusion, which was known to make
    it extremely slow. A lot of work went into making the portable
    runner fusion aware, so we don't need to round trip through coders
    on every ParDo.
 
 
  The old Flink runner got fusion for free, since Flink does it.
    The new fusion in portability is because fusing the runner side of
    portability steps does not achieve real fusion
 >>>
 >>>
 >>> Aha, I see. So the feature in Flink is operator chaining and
    Flink per default initiate copy of input elements. In case of Beam
    coders copy seems to be more noticable than native Flink.
 >>> So do I get it right that in portable runner scenario, you do
    similar chaining via this "fusion of stages"? Curious here... how is
    it different from chaining so runner can be sure that not doing copy
    is "safe" with respect to user defined functions and their behaviour
    over inputs?
 >>>
 >
 >
 > Reuven
 >
 > On Tue, Apr 30, 2019 at 6:58 AM Jozef Vilcek
    mailto:jozo.vil...@gmail.com>> wrote:
 >>
 >> It was not a portable Flink runner.
 >>
 >> Thanks all for the thoughts, I will create JIRAs, as
    suggested, with my findings and send them out
 >>
 >> On Tue, Apr 30, 2019 at 11:34 AM Reuven Lax
    mailto:re...@google.com>> wrote:
 >>>
 >>> Jozef did you use the portable Flink runner or the old one?
 >>>
 >>> Reuven
 >>>
 >>> On Tue, Apr 30, 2019 at 1:03 AM Robert Bradshaw
    mailto:rober...@google.com>> wrote:
 
  Thanks for starting this investigation. As mentioned, most
    of the work
  to date has been on feature parity, not performance
    parity, but we're
  at the point that the latter should be tackled as well.
    Even if there
  is a slight overhead (and there's talk about integrating
    more deeply
  with the Flume DAG that could elide even that) I'd expect
    it should be
  nowhere near the 3x that you're seeing. Aside from the
    timer issue,
  sounds like the cloning via coders is is a huge drag that
    needs to be
 

Re: [DISCUSS] Performance of Beam compare to "Bare Runner"

2019-05-03 Thread Robert Bradshaw
On Fri, May 3, 2019 at 9:29 AM Viliam Durina  wrote:
>
> > you MUST NOT mutate your inputs
> I think it's enough to not mutate the inputs after you emit them. From this 
> follows that when you receive an input, the upstream vertex will not try to 
> mutate it in parallel. This is what Hazelcast Jet expects. We have no option 
> to automatically clone objects after each step.

There's also the case of sibling fusion. E.g. if your graph looks like

   ---> B
 /
A
 \
   ---> C

which all gets fused together, then both B and C are applied to each
output of A, which means it is not safe for B and C to mutate their
inputs lest its sibling (whichever is applied second) see this
mutation.

> On Thu, 2 May 2019 at 20:01, Maximilian Michels  wrote:
>>
>> > I am not sure what are you referring to here. What do you mean Kryo is 
>> > simply slower ... Beam Kryo or Flink Kryo or?
>>
>> Flink uses Kryo as a fallback serializer when its own type serialization
>> system can't analyze the type. I'm just guessing here that this could be
>> slower.
>>
>> On 02.05.19 16:51, Jozef Vilcek wrote:
>> >
>> >
>> > On Thu, May 2, 2019 at 3:41 PM Maximilian Michels > > > wrote:
>> >
>> > Thanks for the JIRA issues Jozef!
>> >
>> >  > So the feature in Flink is operator chaining and Flink per
>> > default initiate copy of input elements. In case of Beam coders copy
>> > seems to be more noticable than native Flink.
>> >
>> > Copying between chained operators can be turned off in the
>> > FlinkPipelineOptions (if you know what you're doing).
>> >
>> >
>> > Yes, I know that it can be instracted to reuse objects (if you are
>> > referring to this). I am just not sure I want to open this door in
>> > general :)
>> > But it is interesting to learn, that with portability, this will be
>> > turned On per default. Quite important finding imho.
>> >
>> > Beam coders should
>> > not be slower than Flink's. They are simple wrapped. It seems Kryo is
>> > simply slower which we could fix by providing more type hints to Flink.
>> >
>> >
>> > I am not sure what are you referring to here. What do you mean Kryo is
>> > simply slower ... Beam Kryo or Flink Kryo or?
>> >
>> > -Max
>> >
>> > On 02.05.19 13:15, Robert Bradshaw wrote:
>> >  > Thanks for filing those.
>> >  >
>> >  > As for how not doing a copy is "safe," it's not really. Beam simply
>> >  > asserts that you MUST NOT mutate your inputs (and direct runners,
>> >  > which are used during testing, do perform extra copies and checks to
>> >  > catch violations of this requirement).
>> >  >
>> >  > On Thu, May 2, 2019 at 1:02 PM Jozef Vilcek
>> > mailto:jozo.vil...@gmail.com>> wrote:
>> >  >>
>> >  >> I have created
>> >  >> https://issues.apache.org/jira/browse/BEAM-7204
>> >  >> https://issues.apache.org/jira/browse/BEAM-7206
>> >  >>
>> >  >> to track these topics further
>> >  >>
>> >  >> On Wed, May 1, 2019 at 1:24 PM Jozef Vilcek
>> > mailto:jozo.vil...@gmail.com>> wrote:
>> >  >>>
>> >  >>>
>> >  >>>
>> >  >>> On Tue, Apr 30, 2019 at 5:42 PM Kenneth Knowles
>> > mailto:k...@apache.org>> wrote:
>> >  
>> >  
>> >  
>> >   On Tue, Apr 30, 2019, 07:05 Reuven Lax > > > wrote:
>> >  >
>> >  > In that case, Robert's point is quite valid. The old Flink
>> > runner I believe had no knowledge of fusion, which was known to make
>> > it extremely slow. A lot of work went into making the portable
>> > runner fusion aware, so we don't need to round trip through coders
>> > on every ParDo.
>> >  
>> >  
>> >   The old Flink runner got fusion for free, since Flink does it.
>> > The new fusion in portability is because fusing the runner side of
>> > portability steps does not achieve real fusion
>> >  >>>
>> >  >>>
>> >  >>> Aha, I see. So the feature in Flink is operator chaining and
>> > Flink per default initiate copy of input elements. In case of Beam
>> > coders copy seems to be more noticable than native Flink.
>> >  >>> So do I get it right that in portable runner scenario, you do
>> > similar chaining via this "fusion of stages"? Curious here... how is
>> > it different from chaining so runner can be sure that not doing copy
>> > is "safe" with respect to user defined functions and their behaviour
>> > over inputs?
>> >  >>>
>> >  >
>> >  >
>> >  > Reuven
>> >  >
>> >  > On Tue, Apr 30, 2019 at 6:58 AM Jozef Vilcek
>> > mailto:jozo.vil...@gmail.com>> wrote:
>> >  >>
>> >  >> It was not a portable Flink runner.
>> >  >>
>> >  >> Thanks all for the thoughts, I will create JIRAs, as
>> > suggested, with my findings and send them out
>> >  >>
>> >  >> On Tue, Apr 30, 2019 at 11:34 

Re: [DISCUSS] Performance of Beam compare to "Bare Runner"

2019-05-03 Thread Viliam Durina
> you MUST NOT mutate your inputs
I think it's enough to not mutate the inputs after you emit them. From this
follows that when you receive an input, the upstream vertex will not try to
mutate it in parallel. This is what Hazelcast Jet expects. We have no
option to automatically clone objects after each step.

Viliam

On Thu, 2 May 2019 at 20:01, Maximilian Michels  wrote:

> > I am not sure what are you referring to here. What do you mean Kryo is
> simply slower ... Beam Kryo or Flink Kryo or?
>
> Flink uses Kryo as a fallback serializer when its own type serialization
> system can't analyze the type. I'm just guessing here that this could be
> slower.
>
> On 02.05.19 16:51, Jozef Vilcek wrote:
> >
> >
> > On Thu, May 2, 2019 at 3:41 PM Maximilian Michels  > > wrote:
> >
> > Thanks for the JIRA issues Jozef!
> >
> >  > So the feature in Flink is operator chaining and Flink per
> > default initiate copy of input elements. In case of Beam coders copy
> > seems to be more noticable than native Flink.
> >
> > Copying between chained operators can be turned off in the
> > FlinkPipelineOptions (if you know what you're doing).
> >
> >
> > Yes, I know that it can be instracted to reuse objects (if you are
> > referring to this). I am just not sure I want to open this door in
> > general :)
> > But it is interesting to learn, that with portability, this will be
> > turned On per default. Quite important finding imho.
> >
> > Beam coders should
> > not be slower than Flink's. They are simple wrapped. It seems Kryo is
> > simply slower which we could fix by providing more type hints to
> Flink.
> >
> >
> > I am not sure what are you referring to here. What do you mean Kryo is
> > simply slower ... Beam Kryo or Flink Kryo or?
> >
> > -Max
> >
> > On 02.05.19 13:15, Robert Bradshaw wrote:
> >  > Thanks for filing those.
> >  >
> >  > As for how not doing a copy is "safe," it's not really. Beam
> simply
> >  > asserts that you MUST NOT mutate your inputs (and direct runners,
> >  > which are used during testing, do perform extra copies and checks
> to
> >  > catch violations of this requirement).
> >  >
> >  > On Thu, May 2, 2019 at 1:02 PM Jozef Vilcek
> > mailto:jozo.vil...@gmail.com>> wrote:
> >  >>
> >  >> I have created
> >  >> https://issues.apache.org/jira/browse/BEAM-7204
> >  >> https://issues.apache.org/jira/browse/BEAM-7206
> >  >>
> >  >> to track these topics further
> >  >>
> >  >> On Wed, May 1, 2019 at 1:24 PM Jozef Vilcek
> > mailto:jozo.vil...@gmail.com>> wrote:
> >  >>>
> >  >>>
> >  >>>
> >  >>> On Tue, Apr 30, 2019 at 5:42 PM Kenneth Knowles
> > mailto:k...@apache.org>> wrote:
> >  
> >  
> >  
> >   On Tue, Apr 30, 2019, 07:05 Reuven Lax  > > wrote:
> >  >
> >  > In that case, Robert's point is quite valid. The old Flink
> > runner I believe had no knowledge of fusion, which was known to make
> > it extremely slow. A lot of work went into making the portable
> > runner fusion aware, so we don't need to round trip through coders
> > on every ParDo.
> >  
> >  
> >   The old Flink runner got fusion for free, since Flink does it.
> > The new fusion in portability is because fusing the runner side of
> > portability steps does not achieve real fusion
> >  >>>
> >  >>>
> >  >>> Aha, I see. So the feature in Flink is operator chaining and
> > Flink per default initiate copy of input elements. In case of Beam
> > coders copy seems to be more noticable than native Flink.
> >  >>> So do I get it right that in portable runner scenario, you do
> > similar chaining via this "fusion of stages"? Curious here... how is
> > it different from chaining so runner can be sure that not doing copy
> > is "safe" with respect to user defined functions and their behaviour
> > over inputs?
> >  >>>
> >  >
> >  >
> >  > Reuven
> >  >
> >  > On Tue, Apr 30, 2019 at 6:58 AM Jozef Vilcek
> > mailto:jozo.vil...@gmail.com>> wrote:
> >  >>
> >  >> It was not a portable Flink runner.
> >  >>
> >  >> Thanks all for the thoughts, I will create JIRAs, as
> > suggested, with my findings and send them out
> >  >>
> >  >> On Tue, Apr 30, 2019 at 11:34 AM Reuven Lax
> > mailto:re...@google.com>> wrote:
> >  >>>
> >  >>> Jozef did you use the portable Flink runner or the old one?
> >  >>>
> >  >>> Reuven
> >  >>>
> >  >>> On Tue, Apr 30, 2019 at 1:03 AM Robert Bradshaw
> > mailto:rober...@google.com>> wrote:
> >  
> >   Thanks for starting this investigation. As mentioned, most
> > of the work
> >   to date has been on feature