Re: Contributing Beam Kata (Java & Python)

2019-05-15 Thread Henry Suryawirawan
Hi Austin,

Yes I agree with you that we need to come up with the publishing process.
We should publish to Stepik only if it has been merged to master.
I can add more authors & instructors to the courses on Stepik so that more
people can update the courses.
I uploaded the initial version to Stepik in order to get the course ID
(which is also committed to the repo) and also to easily show it to other
people who would like to give it a try.
Last time when I ran training using it, there were setup issues and people
had difficulty following the instruction.

Another thing that we should think about is how to integrate the Kata
projects into the CI build process.
For the Java Kata, it is based on Gradle, while for the Python one, I'm
currently checking with the JetBrains team if there is a good way of
executing the tests cause the current way of how the test is written is
non-standard.
Can anyone advise on how best the Kata projects should be integrated to the
CI?


*From: *Austin Bennett 
*Date: *Wed, May 15, 2019 at 11:49 PM
*To: *dev

Stepik: should we thinking about release versions for what gets uploaded
> there?  If the point of merging was to encourage additional contributions -
> then should we also have a mechanism for publishing the updates there?  In
> that case, then uploading to stepik should be part of publishing releases?
>
> On Tue, May 14, 2019 at 10:51 PM hsuryawira...@google.com <
> hsuryawira...@google.com> wrote:
>
>> Thanks for merging it Reuven!
>>
>> Quick question, would it be useful if we write a blog post on the Kata so
>> that we can build more awareness for people to try out?
>> I've also uploaded the course to Stepik which has seamless integration
>> within the IDE for people to easily start the course.
>>
>> On 2019/05/14 20:35:07, Reuven Lax  wrote:
>> > Merged
>> >
>> > *From: *Reza Rokni 
>> > *Date: *Tue, May 14, 2019 at 1:29 PM
>> > *To: * 
>> > *Cc: *Lars Francke
>> >
>> > +1 :-)
>> > >
>> > > *From: *Lukasz Cwik 
>> > > *Date: *Wed, 15 May 2019 at 04:29
>> > > *To: *dev
>> > > *Cc: *Lars Francke
>> > >
>> > > +1
>> > >>
>> > >> *From: *Pablo Estrada 
>> > >> *Date: *Tue, May 14, 2019 at 1:27 PM
>> > >> *To: *dev
>> > >> *Cc: *Lars Francke
>> > >>
>> > >> +1 on merging.
>> > >>>
>> > >>> *From: *Reuven Lax 
>> > >>> *Date: *Tue, May 14, 2019 at 1:23 PM
>> > >>> *To: *dev
>> > >>> *Cc: *Lars Francke
>> > >>>
>> > >>> I've been playing around with this that past day .or two, and it's
>> >  great! I'm inclined to merge this PR (if nobody objects) so that
>> others in
>> >  the community can contribute more training katas.
>> > 
>> >  Reuven
>> > 
>> >  *From: *Ismaël Mejía 
>> >  *Date: *Tue, Apr 23, 2019 at 6:43 AM
>> >  *To: *Lars Francke
>> >  *Cc: * 
>> > 
>> >  Thanks for answering Lars,
>> > >
>> > > The 'interesting' part is that the tutorial has a full IDE
>> integrated
>> > > experience based on the Jetbrains edu platform [1]. So maybe
>> > > interesting to see if it could make sense to have projects like
>> this
>> > > in the new trainings incubator project or if they became too
>> platform
>> > > constrained.
>> > >
>> > > This contribution is valuable for Beam but the community may
>> decide
>> > > that it makes sense for it to live at some moment at the trainings
>> > > project. I suppose also Henry could be interested in taking a
>> look at
>> > > this [2].
>> > >
>> > > [1] https://www.jetbrains.com/education/
>> > > [2] https://incubator.apache.org/clutch/training.html
>> > >
>> > > On Tue, Apr 23, 2019 at 3:00 PM Lars Francke <
>> lars.fran...@gmail.com>
>> > > wrote:
>> > > >
>> > > > Thanks Ismaël.
>> > > >
>> > > > I must admit I'm a tad confused. What has JetBrains got to do
>> with
>> > > this?
>> > > > This looks pretty cool and specific to Beam though, or is this
>> more
>> > > generic?
>> > > > But yeah something along those lines could be interesting for
>> > > hands-on type things in training.
>> > > >
>> > > > On Fri, Apr 19, 2019 at 12:10 PM Ismaël Mejía <
>> ieme...@gmail.com>
>> > > wrote:
>> > > >>
>> > > >> +lars.fran...@gmail.com who is in the Apache training project
>> and
>> > > may
>> > > >> be interested in this one or at least the JetBrains like
>> approach.
>> > > >>
>> > > >> On Fri, Apr 19, 2019 at 12:01 PM Ismaël Mejía <
>> ieme...@gmail.com>
>> > > wrote:
>> > > >> >
>> > > >> > This looks great, nice for bringing this to the project
>> Henry!
>> > > >> >
>> > > >> > On Fri, Apr 19, 2019 at 10:53 AM hsuryawira...@google.com
>> > > >> >  wrote:
>> > > >> > >
>> > > >> > > Thanks Altay.
>> > > >> > > I'll create it under "learning/" first as this is not
>> exactly
>> > > example.
>> > > >> > > Please do let me know if it's not the right place.
>> > > >> > >
>> > > >> > > On 2019/04/18 22:49:47, Ahmet 

Re: Contributing Beam Kata (Java & Python)

2019-05-15 Thread Henry Suryawirawan
Sure, let me try to write one.
I've created a JIRA for it: https://issues.apache.org/jira/browse/BEAM-7332




*From: *Pablo Estrada 
*Date: *Thu, May 16, 2019 at 2:20 AM
*To: *dev

I think a blog post is a great idea. Would you be able to write one?
> See here for an example of a PR adding one:
> https://github.com/apache/beam/pull/8391
>
> Best
> -P.
>
> *From: *hsuryawira...@google.com 
> *Date: *Wed, May 15, 2019, 12:51 AM
> *To: * 
>
> Thanks for merging it Reuven!
>>
>> Quick question, would it be useful if we write a blog post on the Kata so
>> that we can build more awareness for people to try out?
>> I've also uploaded the course to Stepik which has seamless integration
>> within the IDE for people to easily start the course.
>>
>> On 2019/05/14 20:35:07, Reuven Lax  wrote:
>> > Merged
>> >
>> > *From: *Reza Rokni 
>> > *Date: *Tue, May 14, 2019 at 1:29 PM
>> > *To: * 
>> > *Cc: *Lars Francke
>> >
>> > +1 :-)
>> > >
>> > > *From: *Lukasz Cwik 
>> > > *Date: *Wed, 15 May 2019 at 04:29
>> > > *To: *dev
>> > > *Cc: *Lars Francke
>> > >
>> > > +1
>> > >>
>> > >> *From: *Pablo Estrada 
>> > >> *Date: *Tue, May 14, 2019 at 1:27 PM
>> > >> *To: *dev
>> > >> *Cc: *Lars Francke
>> > >>
>> > >> +1 on merging.
>> > >>>
>> > >>> *From: *Reuven Lax 
>> > >>> *Date: *Tue, May 14, 2019 at 1:23 PM
>> > >>> *To: *dev
>> > >>> *Cc: *Lars Francke
>> > >>>
>> > >>> I've been playing around with this that past day .or two, and it's
>> >  great! I'm inclined to merge this PR (if nobody objects) so that
>> others in
>> >  the community can contribute more training katas.
>> > 
>> >  Reuven
>> > 
>> >  *From: *Ismaël Mejía 
>> >  *Date: *Tue, Apr 23, 2019 at 6:43 AM
>> >  *To: *Lars Francke
>> >  *Cc: * 
>> > 
>> >  Thanks for answering Lars,
>> > >
>> > > The 'interesting' part is that the tutorial has a full IDE
>> integrated
>> > > experience based on the Jetbrains edu platform [1]. So maybe
>> > > interesting to see if it could make sense to have projects like
>> this
>> > > in the new trainings incubator project or if they became too
>> platform
>> > > constrained.
>> > >
>> > > This contribution is valuable for Beam but the community may
>> decide
>> > > that it makes sense for it to live at some moment at the trainings
>> > > project. I suppose also Henry could be interested in taking a
>> look at
>> > > this [2].
>> > >
>> > > [1] https://www.jetbrains.com/education/
>> > > [2] https://incubator.apache.org/clutch/training.html
>> > >
>> > > On Tue, Apr 23, 2019 at 3:00 PM Lars Francke <
>> lars.fran...@gmail.com>
>> > > wrote:
>> > > >
>> > > > Thanks Ismaël.
>> > > >
>> > > > I must admit I'm a tad confused. What has JetBrains got to do
>> with
>> > > this?
>> > > > This looks pretty cool and specific to Beam though, or is this
>> more
>> > > generic?
>> > > > But yeah something along those lines could be interesting for
>> > > hands-on type things in training.
>> > > >
>> > > > On Fri, Apr 19, 2019 at 12:10 PM Ismaël Mejía <
>> ieme...@gmail.com>
>> > > wrote:
>> > > >>
>> > > >> +lars.fran...@gmail.com who is in the Apache training project
>> and
>> > > may
>> > > >> be interested in this one or at least the JetBrains like
>> approach.
>> > > >>
>> > > >> On Fri, Apr 19, 2019 at 12:01 PM Ismaël Mejía <
>> ieme...@gmail.com>
>> > > wrote:
>> > > >> >
>> > > >> > This looks great, nice for bringing this to the project
>> Henry!
>> > > >> >
>> > > >> > On Fri, Apr 19, 2019 at 10:53 AM hsuryawira...@google.com
>> > > >> >  wrote:
>> > > >> > >
>> > > >> > > Thanks Altay.
>> > > >> > > I'll create it under "learning/" first as this is not
>> exactly
>> > > example.
>> > > >> > > Please do let me know if it's not the right place.
>> > > >> > >
>> > > >> > > On 2019/04/18 22:49:47, Ahmet Altay 
>> wrote:
>> > > >> > > > This looks great.
>> > > >> > > >
>> > > >> > > > +David Cavazos  was working on
>> > > interactive colab based
>> > > >> > > > examples (https://github.com/apache/beam/pull/7679)
>> perhaps
>> > > we can have a
>> > > >> > > > shared place for these two similar things.
>> > > >> > > >
>> > > >> > >
>> > >
>> > 
>> > >
>> > > --
>> > >
>> > > This email may be confidential and privileged. If you received this
>> > > communication by mistake, please don't forward it to anyone else,
>> please
>> > > erase all copies and attachments, and please let me know that it has
>> gone
>> > > to the wrong person.
>> > >
>> > > The above terms reflect a potential business arrangement, are provided
>> > > solely as a basis for further discussion, and are not intended to be
>> and do
>> > > not constitute a legally binding obligation. No legally binding
>> obligations
>> > > will be created, implied, or inferred until an agreement in final

Re: [ANNOUNCE] New PMC Member: Pablo Estrada

2019-05-15 Thread Rose Nguyen
Congrats, Pablo!!

On Wed, May 15, 2019 at 4:43 PM Heejong Lee  wrote:

> Congratulations!
>
> On Wed, May 15, 2019 at 12:24 PM Niklas Hansson <
> niklas.sven.hans...@gmail.com> wrote:
>
>> Congratulations Pablo :)
>>
>> Den ons 15 maj 2019 kl 21:21 skrev Ruoyun Huang :
>>
>>> Congratulations, Pablo!
>>>
>>> *From: *Charles Chen 
>>> *Date: *Wed, May 15, 2019 at 11:04 AM
>>> *To: *dev
>>>
>>> Congrats Pablo and thank you for your contributions!

 On Wed, May 15, 2019, 10:53 AM Valentyn Tymofieiev 
 wrote:

> Congrats, Pablo!
>
> On Wed, May 15, 2019 at 10:41 AM Yifan Zou 
> wrote:
>
>> Congratulations, Pablo!
>>
>> *From: *Maximilian Michels 
>> *Date: *Wed, May 15, 2019 at 2:06 AM
>> *To: * 
>>
>> Congrats Pablo! Thank you for your help to grow the Beam community!
>>>
>>> On 15.05.19 10:33, Tim Robertson wrote:
>>> > Congratulations Pablo
>>> >
>>> > On Wed, May 15, 2019 at 10:22 AM Ismaël Mejía >> > > wrote:
>>> >
>>> > Congrats Pablo, well deserved, nece to see your work
>>> recognized!
>>> >
>>> > On Wed, May 15, 2019 at 9:59 AM Pei HE >> > > wrote:
>>> >  >
>>> >  > Congrats, Pablo!
>>> >  >
>>> >  > On Tue, May 14, 2019 at 11:41 PM Tanay Tummalapalli
>>> >  > mailto:ttanay.apa...@gmail.com>>
>>> wrote:
>>> >  > >
>>> >  > > Congratulations Pablo!
>>> >  > >
>>> >  > > On Wed, May 15, 2019, 12:08 Michael Luckey <
>>> adude3...@gmail.com
>>> > > wrote:
>>> >  > >>
>>> >  > >> Congrats, Pablo!
>>> >  > >>
>>> >  > >> On Wed, May 15, 2019 at 8:21 AM Connell O'Callaghan
>>> > mailto:conne...@google.com>> wrote:
>>> >  > >>>
>>> >  > >>> Awesome well done Pablo!!!
>>> >  > >>>
>>> >  > >>> Kenn thank you for sharing this great news with us!!!
>>> >  > >>>
>>> >  > >>> On Tue, May 14, 2019 at 11:01 PM Ahmet Altay
>>> > mailto:al...@google.com>> wrote:
>>> >  > 
>>> >  >  Congratulations!
>>> >  > 
>>> >  >  On Tue, May 14, 2019 at 9:11 PM Robert Burke
>>> > mailto:rob...@frantil.com>> wrote:
>>> >  > >
>>> >  > > Woohoo! Well deserved.
>>> >  > >
>>> >  > > On Tue, May 14, 2019, 8:34 PM Reuven Lax <
>>> re...@google.com
>>> > > wrote:
>>> >  > >>
>>> >  > >> Congratulations!
>>> >  > >>
>>> >  > >> From: Mikhail Gryzykhin >> > >
>>> >  > >> Date: Tue, May 14, 2019 at 8:32 PM
>>> >  > >> To: mailto:dev@beam.apache.org
>>> >>
>>> >  > >>
>>> >  > >>> Congratulations Pablo!
>>> >  > >>>
>>> >  > >>> On Tue, May 14, 2019, 20:25 Kenneth Knowles
>>> > mailto:k...@apache.org>> wrote:
>>> >  > 
>>> >  >  Hi all,
>>> >  > 
>>> >  >  Please join me and the rest of the Beam PMC in
>>> welcoming
>>> > Pablo Estrada to join the PMC.
>>> >  > 
>>> >  >  Pablo first picked up BEAM-722 in October of 2016
>>> and
>>> > has been a steady part of the Beam community since then. In
>>> addition
>>> > to technical work on Beam Python & Java & runners, I would
>>> highlight
>>> > how Pablo grows Beam's community by helping users, working on
>>> GSoC,
>>> > giving talks at Beam Summits and other OSS conferences
>>> including
>>> > Flink Forward, and holding training workshops. I cannot do
>>> justice
>>> > to Pablo's contributions in a single paragraph.
>>> >  > 
>>> >  >  Thanks Pablo, for being a part of Beam.
>>> >  > 
>>> >  >  Kenn
>>> >
>>>
>>
>>>
>>> --
>>> 
>>> Ruoyun  Huang
>>>
>>>

-- 
Rose Thị Nguyễn


Re: [ANNOUNCE] New PMC Member: Pablo Estrada

2019-05-15 Thread Heejong Lee
Congratulations!

On Wed, May 15, 2019 at 12:24 PM Niklas Hansson <
niklas.sven.hans...@gmail.com> wrote:

> Congratulations Pablo :)
>
> Den ons 15 maj 2019 kl 21:21 skrev Ruoyun Huang :
>
>> Congratulations, Pablo!
>>
>> *From: *Charles Chen 
>> *Date: *Wed, May 15, 2019 at 11:04 AM
>> *To: *dev
>>
>> Congrats Pablo and thank you for your contributions!
>>>
>>> On Wed, May 15, 2019, 10:53 AM Valentyn Tymofieiev 
>>> wrote:
>>>
 Congrats, Pablo!

 On Wed, May 15, 2019 at 10:41 AM Yifan Zou  wrote:

> Congratulations, Pablo!
>
> *From: *Maximilian Michels 
> *Date: *Wed, May 15, 2019 at 2:06 AM
> *To: * 
>
> Congrats Pablo! Thank you for your help to grow the Beam community!
>>
>> On 15.05.19 10:33, Tim Robertson wrote:
>> > Congratulations Pablo
>> >
>> > On Wed, May 15, 2019 at 10:22 AM Ismaël Mejía > > > wrote:
>> >
>> > Congrats Pablo, well deserved, nece to see your work recognized!
>> >
>> > On Wed, May 15, 2019 at 9:59 AM Pei HE > > > wrote:
>> >  >
>> >  > Congrats, Pablo!
>> >  >
>> >  > On Tue, May 14, 2019 at 11:41 PM Tanay Tummalapalli
>> >  > mailto:ttanay.apa...@gmail.com>>
>> wrote:
>> >  > >
>> >  > > Congratulations Pablo!
>> >  > >
>> >  > > On Wed, May 15, 2019, 12:08 Michael Luckey <
>> adude3...@gmail.com
>> > > wrote:
>> >  > >>
>> >  > >> Congrats, Pablo!
>> >  > >>
>> >  > >> On Wed, May 15, 2019 at 8:21 AM Connell O'Callaghan
>> > mailto:conne...@google.com>> wrote:
>> >  > >>>
>> >  > >>> Awesome well done Pablo!!!
>> >  > >>>
>> >  > >>> Kenn thank you for sharing this great news with us!!!
>> >  > >>>
>> >  > >>> On Tue, May 14, 2019 at 11:01 PM Ahmet Altay
>> > mailto:al...@google.com>> wrote:
>> >  > 
>> >  >  Congratulations!
>> >  > 
>> >  >  On Tue, May 14, 2019 at 9:11 PM Robert Burke
>> > mailto:rob...@frantil.com>> wrote:
>> >  > >
>> >  > > Woohoo! Well deserved.
>> >  > >
>> >  > > On Tue, May 14, 2019, 8:34 PM Reuven Lax <
>> re...@google.com
>> > > wrote:
>> >  > >>
>> >  > >> Congratulations!
>> >  > >>
>> >  > >> From: Mikhail Gryzykhin > > >
>> >  > >> Date: Tue, May 14, 2019 at 8:32 PM
>> >  > >> To: mailto:dev@beam.apache.org
>> >>
>> >  > >>
>> >  > >>> Congratulations Pablo!
>> >  > >>>
>> >  > >>> On Tue, May 14, 2019, 20:25 Kenneth Knowles
>> > mailto:k...@apache.org>> wrote:
>> >  > 
>> >  >  Hi all,
>> >  > 
>> >  >  Please join me and the rest of the Beam PMC in
>> welcoming
>> > Pablo Estrada to join the PMC.
>> >  > 
>> >  >  Pablo first picked up BEAM-722 in October of 2016
>> and
>> > has been a steady part of the Beam community since then. In
>> addition
>> > to technical work on Beam Python & Java & runners, I would
>> highlight
>> > how Pablo grows Beam's community by helping users, working on
>> GSoC,
>> > giving talks at Beam Summits and other OSS conferences including
>> > Flink Forward, and holding training workshops. I cannot do
>> justice
>> > to Pablo's contributions in a single paragraph.
>> >  > 
>> >  >  Thanks Pablo, for being a part of Beam.
>> >  > 
>> >  >  Kenn
>> >
>>
>
>>
>> --
>> 
>> Ruoyun  Huang
>>
>>


Re: Do we maintain offline artifact version in javadocs sdks/java/javadoc/build.gradle

2019-05-15 Thread Lukasz Cwik
I opened up https://github.com/apache/beam/pull/8588 with the changes
listed above.

On Mon, May 13, 2019 at 5:17 PM Ankur Goenka  wrote:

> Given that this simplifies the release process and keeps the javadocs upto
> date., IMO this looks to be a good tradeoff.
>
> *From: *Lukasz Cwik 
> *Date: *Mon, May 13, 2019 at 5:09 PM
> *To: *dev
>
> While I was looking for the latest versions of docs, I found
>> http://javadoc.io. It fetches the javadoc from Maven central and unpacks
>> that jar displaying its contents to users. This means that we could make
>> all our non Apache Beam javadoc links goto javadoc.io instead of trying
>> to find the official project website that maintains them (which sometimes
>> there isn't one or they only have the javadoc for the latest).
>>
>> Has anyone had experience using javadoc.io in the past?
>> Would there by any concerns about swapping to use javadoc.io instead of
>> the official versions hosted on project pages?
>>
>> I have an example commit here:
>> https://github.com/lukecwik/incubator-beam/commit/94a97fbc83883496feae071cc44689f5fb2f5743
>> You can generate the aggregate javadoc via "./gradlew -p
>> sdks/java/javadoc aggregateJavadoc" which builds
>> "./sdks/java/javadoc/build/docs/javadoc/index.html"
>>
>> If people are happy with javadoc.io, should we migrate from using
>> offlinelinks to links so we don't have to maintain the package lists in
>> https://github.com/apache/beam/blob/master/sdks/java/javadoc/?
>> This would mean that we would be able to just enumerate all the
>> dependencies we have in Apache Beam and generate all the javadoc without
>> maintaining a list of packages or dependencies. It would mean that you
>> would need to have an internet connection to build the aggregated javadoc
>> because the javadoc tool would need to fetch the package-list files from
>> javadoc.io. The delta for that change is
>> https://github.com/lukecwik/incubator-beam/commit/8cc7c53139d0eecad0ec994b9a313cf31645
>>
>> From a Javadoc correctness and maintenance point of view, this seems much
>> simpler overall to me.
>>
>>
>> *From: *Lukasz Cwik 
>> *Date: *Mon, May 13, 2019 at 1:39 PM
>> *To: *dev
>>
>> I see. We should be able to fix that to do what we do when we embed the
>>> versions of dependencies in our Maven archetypes like so[1]:
>>> dependencies.create(project.library.java.google_api_client).getVersion()
>>>
>>> I'll send out a PR updating the javadoc pulling to be based off the
>>> version and open up a PR.
>>>
>>> 1:
>>> https://github.com/apache/beam/blob/abece47cc1c1c88a519e54e67a2d358b439cf69c/sdks/java/maven-archetypes/examples/build.gradle#L29
>>>
>>> *From: *Kenneth Knowles 
>>> *Date: *Mon, May 13, 2019 at 11:57 AM
>>> *To: *dev
>>>
>>> I expect Ankur is referring to the hardcoded linkOffline bits here:
 https://github.com/apache/beam/blob/master/sdks/java/javadoc/build.gradle#L78
  since
 the versions are in the URLs, and also the downloaded files used are from
 those versions. This helps with flakiness, since otherwise it has to
 download stuff to figure out which identifiers are linkable.

 Kenn

 *From: *Lukasz Cwik 
 *Date: *Mon, May 13, 2019 at 9:04 AM
 *To: *dev

 What is the difference between the two files you are referring to?
>
> Note that sdks/java/javadoc/build.gradle is meant to produce one giant
> javadoc across many modules that users would be interested in
> (core/extensions/io/...) meant to be published on the website.
>
> *From: *Ankur Goenka 
> *Date: *Fri, May 10, 2019 at 5:21 PM
> *To: *dev
>
> Hi,
>>
>> I see that the sdks/java/javadoc/build.gradle is not in sync with
>> org/apache/beam/gradle/BeamModulePlugin.groovy .
>> I wanted to check if we are maintaining or not based on that we can
>> either remove or update sdks/java/javadoc/build.gradle.
>>
>> Thanks,
>> Ankur
>>
>


Re: PardoLifeCycle: Teardown after failed call to setup

2019-05-15 Thread Lukasz Cwik
Terminating control could be that signal, also environment shutdown could
also be that signal.

On Wed, May 15, 2019 at 7:19 AM Robert Bradshaw  wrote:

> The only signal we have is that the runner terminates the control
> channel. It might make sense to make this more explicit. (This'd be
> especially nice in batch, where we could (hypothetically at least)
> know we'll never run a given stage again.)
>
> On Wed, May 15, 2019 at 3:58 PM Robert Burke  wrote:
> >
> > What is the runner supposed to be doing to trigger the teardown of given
> bundle descriptors in an SDK harness?
> >
> > Is there a fn API call I'm not interpreting correctly that should
> reliably trigger DoFn teardown, or generally that bundle processing is done?
> >
> >
> >
> > On Wed, May 15, 2019, 6:51 AM Robert Bradshaw 
> wrote:
> >>
> >> This does bring up an interesting question though. Are runners
> >> violating (the intent of) the spec if they simply abandon/kill workers
> >> rather than gracefully bringing them down (e.g. so that these
> >> callbacks can be invoked)?
> >>
> >> On Tue, May 7, 2019 at 3:55 PM Michael Luckey 
> wrote:
> >> >
> >> > Thanks Kenn and Reuven. Based on your feedback, I amended to the PR
> [1] implementing the missing calls to teardown.
> >> >
> >> > Best,
> >> >
> >> > michel
> >> >
> >> > [1] https://github.com/apache/beam/pull/8495
> >> >
> >> > On Tue, May 7, 2019 at 6:09 AM Kenneth Knowles 
> wrote:
> >> >>
> >> >>
> >> >>
> >> >> On Mon, May 6, 2019 at 2:19 PM Reuven Lax  wrote:
> >> >>>
> >> >>>
> >> >>>
> >> >>> On Mon, May 6, 2019 at 2:06 PM Kenneth Knowles 
> wrote:
> >> 
> >>  The specification of TearDown is that it is best effort, certainly.
> >> >>>
> >> >>>
> >> >>> Though I believe the intent of that specification was that a runner
> will call it as long as the process itself has not crashed.
> >> >>
> >> >>
> >> >> Yea, exactly. Or more abstractly that a runner will call it unless
> it is impossible. If the hardware fails, a meteor strikes, etc, then
> teardown will not be called. But in normal operation, particularly when the
> user code throws a recoverable exception, it should be called.
> >> >>
> >> >> Kenn
> >> >>
> >> >>>
> >> >>>
> >> 
> >>  If your runner supports it, then the test is good to make sure
> there is not a regression. If your runner has partial support, that is
> within spec. But the idea of the spec is more than you might have such a
> failure that it is impossible to call the method, not simply never trying
> to call it.
> >> 
> >>  I think it seems to match what we do elsewhere to leave the test,
> add an annotation, make a note in the capability matrix about the
> limitation on ParDo.
> >> 
> >>  Kenn
> >> 
> >>  On Mon, May 6, 2019 at 5:45 AM Michael Luckey 
> wrote:
> >> >
> >> > Hi,
> >> >
> >> > after stumbling upon [1] and trying to implement a fix [2],
> ParDoLifeCycleTest are failing for
> >> > direct runner, spark validatesRunnerBatch and flink
> validatesRunnerBatch fail as DoFns teardown is not invoked, if DoFns setup
> throw an exceptions.
> >> >
> >> > This seems to be in line with the specification [3], as this
> explicitly states that 'teardown might not be called if unnecessary as
> processed will be killed anyway'.
> >> >
> >> > No I am a bit lost on how to resolve this situation. Currently,
> we seem to have following options
> >> > - remove the test, although it seems valuable in different (e.g.
> streaming?) cases
> >> > - to satisfy the test implement the call to teardown in runners
> although it seems unnecessary
> >> > - add another annotation @CallsTeardownAfterFailingSetup,
> @UsesFullParDoLifeCycle or such (would love to get suggestions for better
> name here)
> >> > - ?
> >> >
> >> > Thoughts?
> >> >
> >> > Best,
> >> >
> >> > michel
> >> >
> >> >
> >> >
> >> > [1] https://issues.apache.org/jira/browse/BEAM-7197
> >> > [2] https://github.com/apache/beam/pull/8495
> >> > [3]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/DoFn.java#L676-L680
>


Re: Writing bytes to BigQuery with beam

2019-05-15 Thread Robert Burke
For the Go SDK:
BigQueryIO

exists,
but other than maybe one PR that added batching of writes (to avoid the
size limit communicating with BigQuery), the reads are probably going to be
re-written
I don't believe there's any special handling of base64 bytes by the IO
code. Users pass in their types, and the assumption is they use BigQuery
compatible Go schemas. eg. tornadoes example

,

On Wed, 15 May 2019 at 12:41, Valentyn Tymofieiev 
wrote:

> By the way, does anyone know what is the status of BigQuery connector in
> Beam Go and Beam SQL? Perhaps some folks working on these SDKs can chime in
> here.
> I am curious whether these SDKs also make / will make it a responsibility
> of the user to base64-encode bytes. As I mentioned above, it is desirable
> to have a consistent UX across SDK, especially given that we are working on
> adding support for cross-language pipelines (
> https://beam.apache.org/roadmap/connectors-multi-sdk/).
>
> On Wed, May 15, 2019 at 12:26 PM Valentyn Tymofieiev 
> wrote:
>
>> I took a closer look at BigQuery IO implementation in Beam SDK and
>> Dataflow runner while reviewing a few PRs to address BEAM-6769, and I think
>> we have to revise the course of action here.
>>
>> It turns out, that when we first added support for BYTES in Java BiqQuery
>> IO, we designed the API with an expectation that:
>> - On write path the user must pass base64-encoded bytes to the BQ IO. [0]
>> - On read path BQ IO base64-encodes the output result, before serving it
>> to the user. [1]
>>
>> When support for BigQuery was added to Python SDK and Dataflow runner,
>> the runner authors preserved the behavior of treating bytes to be
>> consistent with Java BQ IO - bytes must be base64-encoded by the user, and
>> bytes from BQ IO returned by Dataflow Python runner are base64-encoded.
>>
>> Unfortunately, this behavior is not documented in public documentation or
>> JavaDoc/PyDocs [2-4], and there were no examples illustrating it, up until
>> we added integration tests a few years down the road [5,6]. Thanks to these
>> integration tests we discovered BEAM-6769.
>>
>> I don't have context why we made a decision to avoid handling raw bytes
>> in Beam, however I think keeping consistent treatment of bytes across all
>> SDKs and runners is important for a smooth user experience, especially so
>> when a particular behavior is not documented well.
>>
>> This being said I suggest the following:
>> 1. Let's keep the current expectation that Beam operates only on
>> base64-encoded bytes in BQ IO. It may be reasonable to revise this
>> expectation, but it is beyond the scope of  BEAM-6769.
>> 2. Let's document current behavior of BQ IO w.r.t. of handling bytes.
>> Chances are that if we had such documentation, we wouldn't have had to
>> answer questions raised in this thread. Filed BEAM-7326 to track.
>> 3. Let's revise Python BQ integration tests to clearly communicate that
>> BQ IO expects base64-encoded bytes. Filed BEAM-7327 to track.
>>
>> Coming back to the original message:
>>
>> When writing b’abc’ in python 2 this results in actually writing b'i\xb7'
>>> which is the same as base64.b64decode('abc='))
>>
>> This is expected as Beam BQ IO expect users to base64-encode their bytes.
>>
>>> When writing b’abc’ in python 3 this results in “TypeError: b'abc' is
>>> not JSON serializable”
>>
>> This is a Py3-compatibility bug. We should decode bytes to a str on
>> Python 3. Given that we expect input to be base64-encoded, we can using
>> 'ascii' codec.
>>
>>> When writing b’\xab’ in py2/py3 this gives a “ValueError: 'utf8' codec
>>> can't decode byte 0xab in position 0: invalid start byte. NAN, INF and -INF
>>> values are not JSON compliant
>>
>> This expected since b’\xab’ cannot be base64 decoded.
>>
>>> When reading bytes from BQ they are currently returned as base-64
>>> encoded strings rather then the raw bytes.
>>
>> This is also expected.
>>
>> [0]
>> https://github.com/apache/beam/commit/c7e0010b0d4a3c45148d05f5101f5310bb84c40c#diff-1016cd1e3092d30556292ab7b983c4c8R103
>>
>> [1]
>> https://github.com/apache/beam/commit/c7e0010b0d4a3c45148d05f5101f5310bb84c40c#diff-44025ee9b9c94123967e1df92bfb1c04R207
>> [2] https://beam.apache.org/documentation/io/built-in/google-bigquery/
>> [3]
>> https://beam.apache.org/releases/pydoc/2.12.0/apache_beam.io.gcp.bigquery.html
>> [4]
>> https://beam.apache.org/releases/javadoc/2.12.0/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.html
>> [5]
>> https://github.com/apache/beam/blob/7b1abc923183a9f6336d3d44681b8fcd8785104c/sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryToTableIT.java#L92
>>
>> [6]
>> https://github.com/apache/beam/commit/d6b456dd922655b216b2c5af6548b0f5fe4eb507#diff-7f1bb65cbe782f5a27c5a75b6fe89fbcR112
>>
>>
>> On Tue, Mar 26, 2019 at 11:27 AM 

Re: Problem with gzip

2019-05-15 Thread Allie Chen
Thanks Robert. Yes, reading is the bottleneck, and we cannot do much better
for gzip files, that's why we would like to at least palatalize other
transforms with reading.

I tried with the side input to break the fusion you suggested earlier, and
it does a much better job than using Reshuffle! One test running time if
anyone is interested,

without any fusion break: 6 hours
with Reshuffle: never ends. cancelled after running 6 hours, about half
elements processed at Reshuffle step.
with side input (not using --experiment=use_fastavro yet, I will try it
later): 2 hours

Thanks all for your help!
Allie


*From: *Robert Bradshaw 
*Date: *Wed, May 15, 2019 at 3:34 PM
*To: *dev
*Cc: *user

On Wed, May 15, 2019 at 8:43 PM Allie Chen  wrote:
>
>> Thanks all for your reply. I will try each of them and see how it goes.
>>
>> The experiment I am working now is similar to
>> https://stackoverflow.com/questions/48886943/early-results-from-groupbykey-transform,
>> which tries to get early results from GroupByKey with windowing. I have
>> some code like:
>>
>> Reading | beam.WindowInto(beam.window.GlobalWindows(),
>>
>>   
>> trigger=trigger.Repeatedly(trigger.AfterCount(1)),
>>  accumulation_mode=beam.transforms.trigger.AccumulationMode.DISCARDING)
>>
>> | MapWithAKey
>>
>> | GroupByKey
>>
>> | RemoveKey
>>
>> | OtherTransforms
>>
>>
>> I don't see the window and trigger working, GroupByKey still waits for
>> all elements. I also tried adding a timestamp for each element and using a
>> fixed size window. Seems no impact.
>>
>>
>> Anyone knows how to get the early results from GroupByKey for a bounded
>> source?
>>
>
> Note that this is essentially how Reshuffle() is implemented. However,
> batch never gives early results from a GroupByKey; each stage is executed
> sequentially.
>
> Is the goal here to be able to parallelize the Read with other operations?
> If the Read (and limited-parallelism write) is still the bottleneck, that
> might not help much.
>
>


Re: Writing bytes to BigQuery with beam

2019-05-15 Thread Valentyn Tymofieiev
By the way, does anyone know what is the status of BigQuery connector in
Beam Go and Beam SQL? Perhaps some folks working on these SDKs can chime in
here.
I am curious whether these SDKs also make / will make it a responsibility
of the user to base64-encode bytes. As I mentioned above, it is desirable
to have a consistent UX across SDK, especially given that we are working on
adding support for cross-language pipelines (
https://beam.apache.org/roadmap/connectors-multi-sdk/).

On Wed, May 15, 2019 at 12:26 PM Valentyn Tymofieiev 
wrote:

> I took a closer look at BigQuery IO implementation in Beam SDK and
> Dataflow runner while reviewing a few PRs to address BEAM-6769, and I think
> we have to revise the course of action here.
>
> It turns out, that when we first added support for BYTES in Java BiqQuery
> IO, we designed the API with an expectation that:
> - On write path the user must pass base64-encoded bytes to the BQ IO. [0]
> - On read path BQ IO base64-encodes the output result, before serving it
> to the user. [1]
>
> When support for BigQuery was added to Python SDK and Dataflow runner, the
> runner authors preserved the behavior of treating bytes to be consistent
> with Java BQ IO - bytes must be base64-encoded by the user, and bytes from
> BQ IO returned by Dataflow Python runner are base64-encoded.
>
> Unfortunately, this behavior is not documented in public documentation or
> JavaDoc/PyDocs [2-4], and there were no examples illustrating it, up until
> we added integration tests a few years down the road [5,6]. Thanks to these
> integration tests we discovered BEAM-6769.
>
> I don't have context why we made a decision to avoid handling raw bytes in
> Beam, however I think keeping consistent treatment of bytes across all SDKs
> and runners is important for a smooth user experience, especially so when a
> particular behavior is not documented well.
>
> This being said I suggest the following:
> 1. Let's keep the current expectation that Beam operates only on
> base64-encoded bytes in BQ IO. It may be reasonable to revise this
> expectation, but it is beyond the scope of  BEAM-6769.
> 2. Let's document current behavior of BQ IO w.r.t. of handling bytes.
> Chances are that if we had such documentation, we wouldn't have had to
> answer questions raised in this thread. Filed BEAM-7326 to track.
> 3. Let's revise Python BQ integration tests to clearly communicate that BQ
> IO expects base64-encoded bytes. Filed BEAM-7327 to track.
>
> Coming back to the original message:
>
> When writing b’abc’ in python 2 this results in actually writing b'i\xb7'
>> which is the same as base64.b64decode('abc='))
>
> This is expected as Beam BQ IO expect users to base64-encode their bytes.
>
>> When writing b’abc’ in python 3 this results in “TypeError: b'abc' is not
>> JSON serializable”
>
> This is a Py3-compatibility bug. We should decode bytes to a str on Python
> 3. Given that we expect input to be base64-encoded, we can using 'ascii'
> codec.
>
>> When writing b’\xab’ in py2/py3 this gives a “ValueError: 'utf8' codec
>> can't decode byte 0xab in position 0: invalid start byte. NAN, INF and -INF
>> values are not JSON compliant
>
> This expected since b’\xab’ cannot be base64 decoded.
>
>> When reading bytes from BQ they are currently returned as base-64 encoded
>> strings rather then the raw bytes.
>
> This is also expected.
>
> [0]
> https://github.com/apache/beam/commit/c7e0010b0d4a3c45148d05f5101f5310bb84c40c#diff-1016cd1e3092d30556292ab7b983c4c8R103
>
> [1]
> https://github.com/apache/beam/commit/c7e0010b0d4a3c45148d05f5101f5310bb84c40c#diff-44025ee9b9c94123967e1df92bfb1c04R207
> [2] https://beam.apache.org/documentation/io/built-in/google-bigquery/
> [3]
> https://beam.apache.org/releases/pydoc/2.12.0/apache_beam.io.gcp.bigquery.html
> [4]
> https://beam.apache.org/releases/javadoc/2.12.0/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.html
> [5]
> https://github.com/apache/beam/blob/7b1abc923183a9f6336d3d44681b8fcd8785104c/sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryToTableIT.java#L92
>
> [6]
> https://github.com/apache/beam/commit/d6b456dd922655b216b2c5af6548b0f5fe4eb507#diff-7f1bb65cbe782f5a27c5a75b6fe89fbcR112
>
>
> On Tue, Mar 26, 2019 at 11:27 AM Pablo Estrada  wrote:
>
>> Sure, we can make users explicitly ask for schema autodetection, instead
>> of it being the default when no schema is provided. I think that's
>> reasonable.
>>
>>
>> On Mon, Mar 25, 2019, 7:19 PM Valentyn Tymofieiev 
>> wrote:
>>
>>> Thanks everyone for input on this thread. I think there is a confusion
>>> between not specifying the schema, and asking BigQuery to do schema
>>> autodetection. This is not the same thing, however in recent changes to BQ
>>> IO that happened after 2.11 release, we are forcing schema autodetection,
>>> when schema is not specified, see: [1].
>>>
>>> I think we need to revise this ahead of 2.12. It may be better if users
>>> explicitly opt-in to schema 

Re: Problem with gzip

2019-05-15 Thread Robert Bradshaw
On Wed, May 15, 2019 at 8:43 PM Allie Chen  wrote:

> Thanks all for your reply. I will try each of them and see how it goes.
>
> The experiment I am working now is similar to
> https://stackoverflow.com/questions/48886943/early-results-from-groupbykey-transform,
> which tries to get early results from GroupByKey with windowing. I have
> some code like:
>
> Reading | beam.WindowInto(beam.window.GlobalWindows(),
>
>   
> trigger=trigger.Repeatedly(trigger.AfterCount(1)),
>  accumulation_mode=beam.transforms.trigger.AccumulationMode.DISCARDING)
>
> | MapWithAKey
>
> | GroupByKey
>
> | RemoveKey
>
> | OtherTransforms
>
>
> I don't see the window and trigger working, GroupByKey still waits for all
> elements. I also tried adding a timestamp for each element and using a
> fixed size window. Seems no impact.
>
>
> Anyone knows how to get the early results from GroupByKey for a bounded
> source?
>

Note that this is essentially how Reshuffle() is implemented. However,
batch never gives early results from a GroupByKey; each stage is executed
sequentially.

Is the goal here to be able to parallelize the Read with other operations?
If the Read (and limited-parallelism write) is still the bottleneck, that
might not help much.


Re: Writing bytes to BigQuery with beam

2019-05-15 Thread Valentyn Tymofieiev
I took a closer look at BigQuery IO implementation in Beam SDK and Dataflow
runner while reviewing a few PRs to address BEAM-6769, and I think we have
to revise the course of action here.

It turns out, that when we first added support for BYTES in Java BiqQuery
IO, we designed the API with an expectation that:
- On write path the user must pass base64-encoded bytes to the BQ IO. [0]
- On read path BQ IO base64-encodes the output result, before serving it to
the user. [1]

When support for BigQuery was added to Python SDK and Dataflow runner, the
runner authors preserved the behavior of treating bytes to be consistent
with Java BQ IO - bytes must be base64-encoded by the user, and bytes from
BQ IO returned by Dataflow Python runner are base64-encoded.

Unfortunately, this behavior is not documented in public documentation or
JavaDoc/PyDocs [2-4], and there were no examples illustrating it, up until
we added integration tests a few years down the road [5,6]. Thanks to these
integration tests we discovered BEAM-6769.

I don't have context why we made a decision to avoid handling raw bytes in
Beam, however I think keeping consistent treatment of bytes across all SDKs
and runners is important for a smooth user experience, especially so when a
particular behavior is not documented well.

This being said I suggest the following:
1. Let's keep the current expectation that Beam operates only on
base64-encoded bytes in BQ IO. It may be reasonable to revise this
expectation, but it is beyond the scope of  BEAM-6769.
2. Let's document current behavior of BQ IO w.r.t. of handling bytes.
Chances are that if we had such documentation, we wouldn't have had to
answer questions raised in this thread. Filed BEAM-7326 to track.
3. Let's revise Python BQ integration tests to clearly communicate that BQ
IO expects base64-encoded bytes. Filed BEAM-7327 to track.

Coming back to the original message:

When writing b’abc’ in python 2 this results in actually writing b'i\xb7'
> which is the same as base64.b64decode('abc='))

This is expected as Beam BQ IO expect users to base64-encode their bytes.

> When writing b’abc’ in python 3 this results in “TypeError: b'abc' is not
> JSON serializable”

This is a Py3-compatibility bug. We should decode bytes to a str on Python
3. Given that we expect input to be base64-encoded, we can using 'ascii'
codec.

> When writing b’\xab’ in py2/py3 this gives a “ValueError: 'utf8' codec
> can't decode byte 0xab in position 0: invalid start byte. NAN, INF and -INF
> values are not JSON compliant

This expected since b’\xab’ cannot be base64 decoded.

> When reading bytes from BQ they are currently returned as base-64 encoded
> strings rather then the raw bytes.

This is also expected.

[0]
https://github.com/apache/beam/commit/c7e0010b0d4a3c45148d05f5101f5310bb84c40c#diff-1016cd1e3092d30556292ab7b983c4c8R103

[1]
https://github.com/apache/beam/commit/c7e0010b0d4a3c45148d05f5101f5310bb84c40c#diff-44025ee9b9c94123967e1df92bfb1c04R207
[2] https://beam.apache.org/documentation/io/built-in/google-bigquery/
[3]
https://beam.apache.org/releases/pydoc/2.12.0/apache_beam.io.gcp.bigquery.html
[4]
https://beam.apache.org/releases/javadoc/2.12.0/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.html
[5]
https://github.com/apache/beam/blob/7b1abc923183a9f6336d3d44681b8fcd8785104c/sdks/java/io/google-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryToTableIT.java#L92

[6]
https://github.com/apache/beam/commit/d6b456dd922655b216b2c5af6548b0f5fe4eb507#diff-7f1bb65cbe782f5a27c5a75b6fe89fbcR112


On Tue, Mar 26, 2019 at 11:27 AM Pablo Estrada  wrote:

> Sure, we can make users explicitly ask for schema autodetection, instead
> of it being the default when no schema is provided. I think that's
> reasonable.
>
>
> On Mon, Mar 25, 2019, 7:19 PM Valentyn Tymofieiev 
> wrote:
>
>> Thanks everyone for input on this thread. I think there is a confusion
>> between not specifying the schema, and asking BigQuery to do schema
>> autodetection. This is not the same thing, however in recent changes to BQ
>> IO that happened after 2.11 release, we are forcing schema autodetection,
>> when schema is not specified, see: [1].
>>
>> I think we need to revise this ahead of 2.12. It may be better if users
>> explicitly opt-in to schema autodetection if they wish. Autodetection is an
>> approximation, and in particular, as we figured out in this thread, it does
>> not work correctly for BYTES data.
>>
>> I suspect that if we disable schema autodetection, and/or make previous
>> implementation of BQ sink a default option, we will be able to write BYTES
>> data to a previously created BQ table without specifying the schema, and
>> making a call to BQ to fetch the schema won't be necessary. We'd need to
>> verify that.
>>
>
>> Another interesting note, as per Juta's analysis
>> ,
>> google-cloud-bigquery client does not 

Re: [ANNOUNCE] New PMC Member: Pablo Estrada

2019-05-15 Thread Niklas Hansson
Congratulations Pablo :)

Den ons 15 maj 2019 kl 21:21 skrev Ruoyun Huang :

> Congratulations, Pablo!
>
> *From: *Charles Chen 
> *Date: *Wed, May 15, 2019 at 11:04 AM
> *To: *dev
>
> Congrats Pablo and thank you for your contributions!
>>
>> On Wed, May 15, 2019, 10:53 AM Valentyn Tymofieiev 
>> wrote:
>>
>>> Congrats, Pablo!
>>>
>>> On Wed, May 15, 2019 at 10:41 AM Yifan Zou  wrote:
>>>
 Congratulations, Pablo!

 *From: *Maximilian Michels 
 *Date: *Wed, May 15, 2019 at 2:06 AM
 *To: * 

 Congrats Pablo! Thank you for your help to grow the Beam community!
>
> On 15.05.19 10:33, Tim Robertson wrote:
> > Congratulations Pablo
> >
> > On Wed, May 15, 2019 at 10:22 AM Ismaël Mejía  > > wrote:
> >
> > Congrats Pablo, well deserved, nece to see your work recognized!
> >
> > On Wed, May 15, 2019 at 9:59 AM Pei HE  > > wrote:
> >  >
> >  > Congrats, Pablo!
> >  >
> >  > On Tue, May 14, 2019 at 11:41 PM Tanay Tummalapalli
> >  > mailto:ttanay.apa...@gmail.com>>
> wrote:
> >  > >
> >  > > Congratulations Pablo!
> >  > >
> >  > > On Wed, May 15, 2019, 12:08 Michael Luckey <
> adude3...@gmail.com
> > > wrote:
> >  > >>
> >  > >> Congrats, Pablo!
> >  > >>
> >  > >> On Wed, May 15, 2019 at 8:21 AM Connell O'Callaghan
> > mailto:conne...@google.com>> wrote:
> >  > >>>
> >  > >>> Awesome well done Pablo!!!
> >  > >>>
> >  > >>> Kenn thank you for sharing this great news with us!!!
> >  > >>>
> >  > >>> On Tue, May 14, 2019 at 11:01 PM Ahmet Altay
> > mailto:al...@google.com>> wrote:
> >  > 
> >  >  Congratulations!
> >  > 
> >  >  On Tue, May 14, 2019 at 9:11 PM Robert Burke
> > mailto:rob...@frantil.com>> wrote:
> >  > >
> >  > > Woohoo! Well deserved.
> >  > >
> >  > > On Tue, May 14, 2019, 8:34 PM Reuven Lax <
> re...@google.com
> > > wrote:
> >  > >>
> >  > >> Congratulations!
> >  > >>
> >  > >> From: Mikhail Gryzykhin  > >
> >  > >> Date: Tue, May 14, 2019 at 8:32 PM
> >  > >> To: mailto:dev@beam.apache.org>>
> >  > >>
> >  > >>> Congratulations Pablo!
> >  > >>>
> >  > >>> On Tue, May 14, 2019, 20:25 Kenneth Knowles
> > mailto:k...@apache.org>> wrote:
> >  > 
> >  >  Hi all,
> >  > 
> >  >  Please join me and the rest of the Beam PMC in
> welcoming
> > Pablo Estrada to join the PMC.
> >  > 
> >  >  Pablo first picked up BEAM-722 in October of 2016 and
> > has been a steady part of the Beam community since then. In
> addition
> > to technical work on Beam Python & Java & runners, I would
> highlight
> > how Pablo grows Beam's community by helping users, working on
> GSoC,
> > giving talks at Beam Summits and other OSS conferences including
> > Flink Forward, and holding training workshops. I cannot do
> justice
> > to Pablo's contributions in a single paragraph.
> >  > 
> >  >  Thanks Pablo, for being a part of Beam.
> >  > 
> >  >  Kenn
> >
>

>
> --
> 
> Ruoyun  Huang
>
>


Re: [ANNOUNCE] New PMC Member: Pablo Estrada

2019-05-15 Thread Ankur Goenka
Congratulations Pablo!

On Wed, May 15, 2019, 12:21 PM Ruoyun Huang  wrote:

> Congratulations, Pablo!
>
> *From: *Charles Chen 
> *Date: *Wed, May 15, 2019 at 11:04 AM
> *To: *dev
>
> Congrats Pablo and thank you for your contributions!
>>
>> On Wed, May 15, 2019, 10:53 AM Valentyn Tymofieiev 
>> wrote:
>>
>>> Congrats, Pablo!
>>>
>>> On Wed, May 15, 2019 at 10:41 AM Yifan Zou  wrote:
>>>
 Congratulations, Pablo!

 *From: *Maximilian Michels 
 *Date: *Wed, May 15, 2019 at 2:06 AM
 *To: * 

 Congrats Pablo! Thank you for your help to grow the Beam community!
>
> On 15.05.19 10:33, Tim Robertson wrote:
> > Congratulations Pablo
> >
> > On Wed, May 15, 2019 at 10:22 AM Ismaël Mejía  > > wrote:
> >
> > Congrats Pablo, well deserved, nece to see your work recognized!
> >
> > On Wed, May 15, 2019 at 9:59 AM Pei HE  > > wrote:
> >  >
> >  > Congrats, Pablo!
> >  >
> >  > On Tue, May 14, 2019 at 11:41 PM Tanay Tummalapalli
> >  > mailto:ttanay.apa...@gmail.com>>
> wrote:
> >  > >
> >  > > Congratulations Pablo!
> >  > >
> >  > > On Wed, May 15, 2019, 12:08 Michael Luckey <
> adude3...@gmail.com
> > > wrote:
> >  > >>
> >  > >> Congrats, Pablo!
> >  > >>
> >  > >> On Wed, May 15, 2019 at 8:21 AM Connell O'Callaghan
> > mailto:conne...@google.com>> wrote:
> >  > >>>
> >  > >>> Awesome well done Pablo!!!
> >  > >>>
> >  > >>> Kenn thank you for sharing this great news with us!!!
> >  > >>>
> >  > >>> On Tue, May 14, 2019 at 11:01 PM Ahmet Altay
> > mailto:al...@google.com>> wrote:
> >  > 
> >  >  Congratulations!
> >  > 
> >  >  On Tue, May 14, 2019 at 9:11 PM Robert Burke
> > mailto:rob...@frantil.com>> wrote:
> >  > >
> >  > > Woohoo! Well deserved.
> >  > >
> >  > > On Tue, May 14, 2019, 8:34 PM Reuven Lax <
> re...@google.com
> > > wrote:
> >  > >>
> >  > >> Congratulations!
> >  > >>
> >  > >> From: Mikhail Gryzykhin  > >
> >  > >> Date: Tue, May 14, 2019 at 8:32 PM
> >  > >> To: mailto:dev@beam.apache.org>>
> >  > >>
> >  > >>> Congratulations Pablo!
> >  > >>>
> >  > >>> On Tue, May 14, 2019, 20:25 Kenneth Knowles
> > mailto:k...@apache.org>> wrote:
> >  > 
> >  >  Hi all,
> >  > 
> >  >  Please join me and the rest of the Beam PMC in
> welcoming
> > Pablo Estrada to join the PMC.
> >  > 
> >  >  Pablo first picked up BEAM-722 in October of 2016 and
> > has been a steady part of the Beam community since then. In
> addition
> > to technical work on Beam Python & Java & runners, I would
> highlight
> > how Pablo grows Beam's community by helping users, working on
> GSoC,
> > giving talks at Beam Summits and other OSS conferences including
> > Flink Forward, and holding training workshops. I cannot do
> justice
> > to Pablo's contributions in a single paragraph.
> >  > 
> >  >  Thanks Pablo, for being a part of Beam.
> >  > 
> >  >  Kenn
> >
>

>
> --
> 
> Ruoyun  Huang
>
>


Re: Semantics of PCollection.isBounded

2019-05-15 Thread Jan Lukavský
Hmmm, looking into the code of FlinkRunner (and also by observing 
results from the stateful ParDo), it seems, that I got it wrong from the 
beginning. The data is not sorted before the stateful ParDo, but that a 
little surprises me. How the operator should work in this case? It would 
mean, that in the batch case I have to hold arbitrarily long 
allowedLateness inside the BagState, which seems to be kind of 
suboptimal. Or am I missing something obvious here? I'll describe the 
use case in more detail, let's suppose I have a series of ones and zeros 
and I want emit at each time point value of 1 if value changes from 0 to 
1, value of -1 if changes from 1 to 0 and 0 otherwise. So:


 0, 1, 1, 0, 0, 1 -> 0, 1, 0, -1, 0, 1

Does anyone have a better idea how to solve it? And if not, how to make 
it running on batch, without possibly infinite buffer? Should the input 
to stateful ParDo be sorted in batch case? My intuition would be that it 
should be, because in my understanding of "batch as a special case of 
streaming" in batch case, there is (by default) single window, time 
advances from -inf to +inf at the end, and the data contains no out of 
order data, in places where this might matter (which therefore enables 
some optimizations). The order would be relevant only in the stateful 
ParDo, I'd say.


Jan

On 5/15/19 8:34 PM, Jan Lukavský wrote:
Just to clarify, I understand, that changing semantics of the 
PCollection.isBounded,  is probably impossible now, because would 
probably introduce chicken egg problem. Maybe I will state it more 
clearly - would it be better to be able to run bounded pipelines using 
batch semantics on DirectRunner (including sorting before stateful 
ParDos), or would it be better to come up with some way to notify the 
pipeline that it will be running in a streaming way although it 
consists only of bounded inputs? And I'm not saying how to do it, just 
trying to find out if anyone else ever had such a need.


Jan

On 5/15/19 5:20 PM, Jan Lukavský wrote:

Hi,

I have come across unexpected (at least for me) behavior of some 
apparent inconsistency of how a PCollection is processed in 
DirectRunner and what PCollection.isBounded signals. Let me explain:


 - I have a stateful ParDo, which needs to make sure that elements 
arrive in order - it accomplishes this by defining BagState for 
buffering input elements and sorting them inside this buffer, it also 
keeps track of element with highest timestamp to somehow estimate 
local watermark (minus some allowed lateness), to know when to remove 
elements from the buffer, sort them by time and pass them to some 
(time ordered) processing


 - this seems to work well for streaming (unbounded) data

 - for batch (bounded) data the semantics of stateful ParDo should be 
(please correct me if I'm wrong) that elements always arrive in 
order, because the runner can sort them by timestamp


 - this implies that for batch processed input (bounded) the 
allowedLateness can be set to zero, so that the processing is little 
more effective, because it doesn't have to use the BagState at all


 - now, the trouble seems to be, that DirectRunner always uses 
streaming processing, even if the input is bounded (that is by 
definition possible), but there is no way now to know when it is 
possible to change allowed lateness to zero (because input will 
arrive ordered)


 - so - it seems to me, that either DirectRunner should apply sorting 
to stateful ParDo, when it processes bounded data (the same way that 
other runners do), or it can apply streaming processing, but then it 
should change PCollection.isBounded to UNBOUNDED, even if the input 
is originally bounded


 - that way, the semantics of PCollection.isBounded, would be not if 
the data are known in advance to be finite, but *how* the data are 
going to be processed, which is much more valuable (IMO)


Any thoughts?

 Jan



Re: SqlTransform Metadata

2019-05-15 Thread Robert Bradshaw
On Wed, May 15, 2019 at 8:51 PM Kenneth Knowles  wrote:
>
> On Wed, May 15, 2019 at 3:05 AM Robert Bradshaw  wrote:
>>
>> Isn't there an API for concisely computing new fields from old ones?
>> Perhaps these expressions could contain references to metadata value
>> such as timestamp. Otherwise,
>
> Even so, being able to refer to the timestamp implies something about its 
> presence in a namespace, shared with other user-decided names.

I was thinking that functions may live in a different namespace than fields.

> And it may be nice for users to use that API within the composite 
> SqlTransform. I think there are a lot of options.
>
>> Rather than withMetadata reifying the value as a nested field, with
>> the timestamp, window, etc. at the top level, one could let it take a
>> field name argument that attaches all the metadata as an extra
>> (struct-like) field. This would be like attachX, but without having to
>> have a separate method for every X.
>
> If you leave the input field names at the top level, then any "attach" style 
> API requires choosing a name that doesn't conflict with input field names. 
> You can't write a generic transform that works with all inputs. I think it is 
> much simpler to move the input field all into a nested row/struct. Putting 
> all the metadata in a second nested row/struct is just as good as top-level, 
> perhaps. But moving the input into the struct/row is important.

Very good point about writing generic transforms. It does mean a lot
of editing if one decides one wants to access the metadata field(s)
after-the-fact. (I also don't think we need to put the metadata in a
nested struct if the value is.)

>> It seems restrictive to only consider this a a special mode for
>> SqlTransform rather than a more generic operation. (For SQL, my first
>> instinct would be to just make this a special function like
>> element_timestamp(), but there is some ambiguity there when there are
>> multiple tables in the expression.)
>
> I would propose it as both: we already have some Reify transforms, and you 
> could make a general operation that does this small data preparation easily. 
> I think the proposal is just to add a convenience build method on 
> SqlTransform to include the underlying functionality as part of the 
> composite, which we really already have.
>
> I don't think we should extend SQL with built-in functions for 
> element_timestamp() and things like that, because SQL already has TIMESTAMP 
> columns and it is very natural to use SQL on unbounded relations where the 
> timestamp is just part of the data.

That's why I was suggesting a single element_metadata() rather than
exploding each one out.

Do you have a pointer to what the TIMESTAMP columns are? (I'm assuming
this is a special field, but distinct from the metadata timestamp?)

>> On Wed, May 15, 2019 at 5:03 AM Reza Rokni  wrote:
>> >
>> > Hi,
>> >
>> > One use case would be when dealing with the windowing functions for 
>> > example:
>> >
>> > SELECT f_int, COUNT(*) , TUMBLE_START(f_timestamp, INTERVAL '1' HOUR) 
>> > tumble_start
>> >   FROM PCOLLECTION
>> >   GROUP BY
>> > f_int,
>> > TUMBLE(f_timestamp, INTERVAL '1' HOUR)
>> >
>> > For an element which is using Metadata to inform the EvenTime of the 
>> > element, rather than data within the element itself, I would need to 
>> > create a new schema which added the timestamp as a field. I think other 
>> > examples which maybe interesting is getting the value of a row with the 
>> > max/min timestamp. None of this would be difficult but it does feel a 
>> > little on the verbose side and also makes the pipeline a little harder to 
>> > read.
>> >
>> > Cheers
>> > Reza
>> >
>> >
>> >
>> >
>> >
>> > From: Kenneth Knowles 
>> > Date: Wed, 15 May 2019 at 01:15
>> > To: dev
>> >
>> >> We have support for nested rows so this should be easy. The .withMetadata 
>> >> would reify the struct, moving from Row to WindowedValue if I 
>> >> understand it...
>> >>
>> >> SqlTransform.query("SELECT field1 from PCOLLECTION"):
>> >>
>> >> Schema = {
>> >>   field1: type1,
>> >>   field2: type2
>> >> }
>> >>
>> >> SqlTransform.query(...)
>> >>
>> >> SqlTransform.withMetadata().query("SELECT event_timestamp, value.field1 
>> >> FROM PCOLLECTION")
>> >>
>> >> Derived schema = {
>> >>   event_timestamp: TIMESTAMP,
>> >>   pane_info: { ... }
>> >>   value: {
>> >> field1: type1,
>> >> field2: type2,
>> >> ...
>> >>   }
>> >> }
>> >>
>> >> SqlTransform would expand into a different composite, and it would be a 
>> >> straightforward ParDo to adjust the data, possibly automatic via the new 
>> >> schema conversions.
>> >>
>> >> Embedding the window would be a bit wonky, something like { 
>> >> end_of_window: TIMESTAMP, encoded_window: bytes } which would be 
>> >> expensive due to encoding. But timestamp and pane info not so bad.
>> >>
>> >> Kenn
>> >>
>> >> From: Anton Kedin 
>> >> Date: Tue, May 14, 2019 

Re: Dealing with incompatible changes in build system on LTS releases

2019-05-15 Thread Kenneth Knowles
On Wed, May 15, 2019 at 11:21 AM Lukasz Cwik  wrote:

>
>
> *From: *Michael Luckey 
> *Date: *Tue, May 14, 2019 at 11:42 PM
> *To: * 
>
> Hi,
>>
>> do we currently have a strategy on how to handle LTS releases in context
>> of incompatible changes  on the build system?
>>
>> As far as I can see, the problem is (at least) twofold.
>>
>> 1. Incompatible changes on test-infra job definitions
>>
>> There might be changes in our groovy files which make it impossible to
>> build/test an old branch on Jenkins. How do we intend to handle this? Of
>> course, in that cases we could run seed job and reset Jenkins to
>> corresponding old state but this will impact or even stall development on
>> master.
>>
>
> There was a point in time where people were looking at migrating to
> Jenkins pipelines since the pipeline definitions are in the source
> repository. Any solution that moves configuration to be based upon the
> source repository and out of Jenkins would address/mitigate this issue.
>

+1000

I've looked around for documentation that makes this as clear as
.travel.yml docs, but didn't find anything like that. If you know of any
good docs on this, please share. Or just scrape the knowledge from other
projects that are using it.


> 2. Incompatible changes on agents
>>
>> Even worse, we might introduce changes on the agents itself, which will
>> even render it impossible to successfully seed to that legacy state. Do we
>> have any option to revert to an old Jenkins agent setup in such cases? I am
>> currently unaware of a link from apache repo to Jenkins configuration state
>> to enable restauration of (old) agents? Is there such thing?
>>
>> Would it be possible/helpful to subdivide our Jenkins agent pool in some
>> way that seed job could be run only on a dedicated subgroup (which then
>> could be set to an old state)? If I recall correctly Yifan put a lot of
>> effort in migrating our agents to the newer jnlp approach.
>>  and used a 'private' agent to do require testing. I assume this has been
>> a manual setup and is not automated to be useful in such cases?
>>
>> What do others think about this issue? Is it something to follow on or
>> more of a non issue?
>>
>>
> What about dockerizing the builds/tests, this would allow us to use older
> versions of the docker container for older branches. Contributors
> interested in Python development were looking at this for testing Py2 and
> Py3 at the same time[1].
>

I think docker-in-docker was the blocker.

Kenn


>
>
>> Best,
>>
>> michel
>>
>>
>>
> 1:
> https://lists.apache.org/thread.html/4be0f687135b7c6778224dd76389a39f9ebf78a3cf9c4cb4e76ebb73@%3Cdev.beam.apache.org%3E
>
>


Re: SqlTransform Metadata

2019-05-15 Thread Kenneth Knowles
On Wed, May 15, 2019 at 3:05 AM Robert Bradshaw  wrote:

> Isn't there an API for concisely computing new fields from old ones?
> Perhaps these expressions could contain references to metadata value
> such as timestamp. Otherwise,
>

Even so, being able to refer to the timestamp implies something about its
presence in a namespace, shared with other user-decided names. And it may
be nice for users to use that API within the composite SqlTransform. I
think there are a lot of options.

Rather than withMetadata reifying the value as a nested field, with
> the timestamp, window, etc. at the top level, one could let it take a
> field name argument that attaches all the metadata as an extra
> (struct-like) field. This would be like attachX, but without having to
> have a separate method for every X.
>

If you leave the input field names at the top level, then any "attach"
style API requires choosing a name that doesn't conflict with input field
names. You can't write a generic transform that works with all inputs. I
think it is much simpler to move the input field all into a nested
row/struct. Putting all the metadata in a second nested row/struct is just
as good as top-level, perhaps. But moving the input into the struct/row is
important.


> It seems restrictive to only consider this a a special mode for
> SqlTransform rather than a more generic operation. (For SQL, my first
> instinct would be to just make this a special function like
> element_timestamp(), but there is some ambiguity there when there are
> multiple tables in the expression.)
>

I would propose it as both: we already have some Reify transforms, and you
could make a general operation that does this small data preparation
easily. I think the proposal is just to add a convenience build method on
SqlTransform to include the underlying functionality as part of the
composite, which we really already have.

I don't think we should extend SQL with built-in functions for
element_timestamp() and things like that, because SQL already has TIMESTAMP
columns and it is very natural to use SQL on unbounded relations where the
timestamp is just part of the data.

Kenn


>
> On Wed, May 15, 2019 at 5:03 AM Reza Rokni  wrote:
> >
> > Hi,
> >
> > One use case would be when dealing with the windowing functions for
> example:
> >
> > SELECT f_int, COUNT(*) , TUMBLE_START(f_timestamp, INTERVAL '1' HOUR)
> tumble_start
> >   FROM PCOLLECTION
> >   GROUP BY
> > f_int,
> > TUMBLE(f_timestamp, INTERVAL '1' HOUR)
> >
> > For an element which is using Metadata to inform the EvenTime of the
> element, rather than data within the element itself, I would need to create
> a new schema which added the timestamp as a field. I think other examples
> which maybe interesting is getting the value of a row with the max/min
> timestamp. None of this would be difficult but it does feel a little on the
> verbose side and also makes the pipeline a little harder to read.
> >
> > Cheers
> > Reza
> >
> >
> >
> >
> >
> > From: Kenneth Knowles 
> > Date: Wed, 15 May 2019 at 01:15
> > To: dev
> >
> >> We have support for nested rows so this should be easy. The
> .withMetadata would reify the struct, moving from Row to WindowedValue
> if I understand it...
> >>
> >> SqlTransform.query("SELECT field1 from PCOLLECTION"):
> >>
> >> Schema = {
> >>   field1: type1,
> >>   field2: type2
> >> }
> >>
> >> SqlTransform.query(...)
> >>
> >> SqlTransform.withMetadata().query("SELECT event_timestamp, value.field1
> FROM PCOLLECTION")
> >>
> >> Derived schema = {
> >>   event_timestamp: TIMESTAMP,
> >>   pane_info: { ... }
> >>   value: {
> >> field1: type1,
> >> field2: type2,
> >> ...
> >>   }
> >> }
> >>
> >> SqlTransform would expand into a different composite, and it would be a
> straightforward ParDo to adjust the data, possibly automatic via the new
> schema conversions.
> >>
> >> Embedding the window would be a bit wonky, something like {
> end_of_window: TIMESTAMP, encoded_window: bytes } which would be expensive
> due to encoding. But timestamp and pane info not so bad.
> >>
> >> Kenn
> >>
> >> From: Anton Kedin 
> >> Date: Tue, May 14, 2019 at 9:17 AM
> >> To: 
> >>
> >>> Reza, can you share more thoughts on how you think this can work
> end-to-end?
> >>>
> >>> Currently the approach is that populating the rows with the data
> happens before the SqlTransform, and within the query you can only use the
> things that are already in the rows or in the catalog/schema (or built-in
> things). In general case populating the rows with any data can be solved
> via a ParDo before SqlTransform. Do you think this approach lacks something
> or maybe too verbose?
> >>>
> >>> My thoughts on this, lacking more info or concrete examples: in order
> to access a timestamp value from within a query there has to be a syntax
> for it. Field access expressions or function calls are the only things that
> come to mind among existing syntax 

Re: Semantics of PCollection.isBounded

2019-05-15 Thread Jan Lukavský
Just to clarify, I understand, that changing semantics of the 
PCollection.isBounded,  is probably impossible now, because would 
probably introduce chicken egg problem. Maybe I will state it more 
clearly - would it be better to be able to run bounded pipelines using 
batch semantics on DirectRunner (including sorting before stateful 
ParDos), or would it be better to come up with some way to notify the 
pipeline that it will be running in a streaming way although it consists 
only of bounded inputs? And I'm not saying how to do it, just trying to 
find out if anyone else ever had such a need.


Jan

On 5/15/19 5:20 PM, Jan Lukavský wrote:

Hi,

I have come across unexpected (at least for me) behavior of some 
apparent inconsistency of how a PCollection is processed in 
DirectRunner and what PCollection.isBounded signals. Let me explain:


 - I have a stateful ParDo, which needs to make sure that elements 
arrive in order - it accomplishes this by defining BagState for 
buffering input elements and sorting them inside this buffer, it also 
keeps track of element with highest timestamp to somehow estimate 
local watermark (minus some allowed lateness), to know when to remove 
elements from the buffer, sort them by time and pass them to some 
(time ordered) processing


 - this seems to work well for streaming (unbounded) data

 - for batch (bounded) data the semantics of stateful ParDo should be 
(please correct me if I'm wrong) that elements always arrive in order, 
because the runner can sort them by timestamp


 - this implies that for batch processed input (bounded) the 
allowedLateness can be set to zero, so that the processing is little 
more effective, because it doesn't have to use the BagState at all


 - now, the trouble seems to be, that DirectRunner always uses 
streaming processing, even if the input is bounded (that is by 
definition possible), but there is no way now to know when it is 
possible to change allowed lateness to zero (because input will arrive 
ordered)


 - so - it seems to me, that either DirectRunner should apply sorting 
to stateful ParDo, when it processes bounded data (the same way that 
other runners do), or it can apply streaming processing, but then it 
should change PCollection.isBounded to UNBOUNDED, even if the input is 
originally bounded


 - that way, the semantics of PCollection.isBounded, would be not if 
the data are known in advance to be finite, but *how* the data are 
going to be processed, which is much more valuable (IMO)


Any thoughts?

 Jan



Re: Contributing Beam Kata (Java & Python)

2019-05-15 Thread Pablo Estrada
I think a blog post is a great idea. Would you be able to write one?
See here for an example of a PR adding one:
https://github.com/apache/beam/pull/8391

Best
-P.

*From: *hsuryawira...@google.com 
*Date: *Wed, May 15, 2019, 12:51 AM
*To: * 

Thanks for merging it Reuven!
>
> Quick question, would it be useful if we write a blog post on the Kata so
> that we can build more awareness for people to try out?
> I've also uploaded the course to Stepik which has seamless integration
> within the IDE for people to easily start the course.
>
> On 2019/05/14 20:35:07, Reuven Lax  wrote:
> > Merged
> >
> > *From: *Reza Rokni 
> > *Date: *Tue, May 14, 2019 at 1:29 PM
> > *To: * 
> > *Cc: *Lars Francke
> >
> > +1 :-)
> > >
> > > *From: *Lukasz Cwik 
> > > *Date: *Wed, 15 May 2019 at 04:29
> > > *To: *dev
> > > *Cc: *Lars Francke
> > >
> > > +1
> > >>
> > >> *From: *Pablo Estrada 
> > >> *Date: *Tue, May 14, 2019 at 1:27 PM
> > >> *To: *dev
> > >> *Cc: *Lars Francke
> > >>
> > >> +1 on merging.
> > >>>
> > >>> *From: *Reuven Lax 
> > >>> *Date: *Tue, May 14, 2019 at 1:23 PM
> > >>> *To: *dev
> > >>> *Cc: *Lars Francke
> > >>>
> > >>> I've been playing around with this that past day .or two, and it's
> >  great! I'm inclined to merge this PR (if nobody objects) so that
> others in
> >  the community can contribute more training katas.
> > 
> >  Reuven
> > 
> >  *From: *Ismaël Mejía 
> >  *Date: *Tue, Apr 23, 2019 at 6:43 AM
> >  *To: *Lars Francke
> >  *Cc: * 
> > 
> >  Thanks for answering Lars,
> > >
> > > The 'interesting' part is that the tutorial has a full IDE
> integrated
> > > experience based on the Jetbrains edu platform [1]. So maybe
> > > interesting to see if it could make sense to have projects like
> this
> > > in the new trainings incubator project or if they became too
> platform
> > > constrained.
> > >
> > > This contribution is valuable for Beam but the community may decide
> > > that it makes sense for it to live at some moment at the trainings
> > > project. I suppose also Henry could be interested in taking a look
> at
> > > this [2].
> > >
> > > [1] https://www.jetbrains.com/education/
> > > [2] https://incubator.apache.org/clutch/training.html
> > >
> > > On Tue, Apr 23, 2019 at 3:00 PM Lars Francke <
> lars.fran...@gmail.com>
> > > wrote:
> > > >
> > > > Thanks Ismaël.
> > > >
> > > > I must admit I'm a tad confused. What has JetBrains got to do
> with
> > > this?
> > > > This looks pretty cool and specific to Beam though, or is this
> more
> > > generic?
> > > > But yeah something along those lines could be interesting for
> > > hands-on type things in training.
> > > >
> > > > On Fri, Apr 19, 2019 at 12:10 PM Ismaël Mejía  >
> > > wrote:
> > > >>
> > > >> +lars.fran...@gmail.com who is in the Apache training project
> and
> > > may
> > > >> be interested in this one or at least the JetBrains like
> approach.
> > > >>
> > > >> On Fri, Apr 19, 2019 at 12:01 PM Ismaël Mejía <
> ieme...@gmail.com>
> > > wrote:
> > > >> >
> > > >> > This looks great, nice for bringing this to the project Henry!
> > > >> >
> > > >> > On Fri, Apr 19, 2019 at 10:53 AM hsuryawira...@google.com
> > > >> >  wrote:
> > > >> > >
> > > >> > > Thanks Altay.
> > > >> > > I'll create it under "learning/" first as this is not
> exactly
> > > example.
> > > >> > > Please do let me know if it's not the right place.
> > > >> > >
> > > >> > > On 2019/04/18 22:49:47, Ahmet Altay 
> wrote:
> > > >> > > > This looks great.
> > > >> > > >
> > > >> > > > +David Cavazos  was working on
> > > interactive colab based
> > > >> > > > examples (https://github.com/apache/beam/pull/7679)
> perhaps
> > > we can have a
> > > >> > > > shared place for these two similar things.
> > > >> > > >
> > > >> > >
> > >
> > 
> > >
> > > --
> > >
> > > This email may be confidential and privileged. If you received this
> > > communication by mistake, please don't forward it to anyone else,
> please
> > > erase all copies and attachments, and please let me know that it has
> gone
> > > to the wrong person.
> > >
> > > The above terms reflect a potential business arrangement, are provided
> > > solely as a basis for further discussion, and are not intended to be
> and do
> > > not constitute a legally binding obligation. No legally binding
> obligations
> > > will be created, implied, or inferred until an agreement in final form
> is
> > > executed in writing by all parties involved.
> > >
> >
>


Re: Dealing with incompatible changes in build system on LTS releases

2019-05-15 Thread Lukasz Cwik
*From: *Michael Luckey 
*Date: *Tue, May 14, 2019 at 11:42 PM
*To: * 

Hi,
>
> do we currently have a strategy on how to handle LTS releases in context
> of incompatible changes  on the build system?
>
> As far as I can see, the problem is (at least) twofold.
>
> 1. Incompatible changes on test-infra job definitions
>
> There might be changes in our groovy files which make it impossible to
> build/test an old branch on Jenkins. How do we intend to handle this? Of
> course, in that cases we could run seed job and reset Jenkins to
> corresponding old state but this will impact or even stall development on
> master.
>

There was a point in time where people were looking at migrating to Jenkins
pipelines since the pipeline definitions are in the source repository. Any
solution that moves configuration to be based upon the source repository
and out of Jenkins would address/mitigate this issue.


> 2. Incompatible changes on agents
>
> Even worse, we might introduce changes on the agents itself, which will
> even render it impossible to successfully seed to that legacy state. Do we
> have any option to revert to an old Jenkins agent setup in such cases? I am
> currently unaware of a link from apache repo to Jenkins configuration state
> to enable restauration of (old) agents? Is there such thing?
>
> Would it be possible/helpful to subdivide our Jenkins agent pool in some
> way that seed job could be run only on a dedicated subgroup (which then
> could be set to an old state)? If I recall correctly Yifan put a lot of
> effort in migrating our agents to the newer jnlp approach.
>  and used a 'private' agent to do require testing. I assume this has been
> a manual setup and is not automated to be useful in such cases?
>
> What do others think about this issue? Is it something to follow on or
> more of a non issue?
>
>
What about dockerizing the builds/tests, this would allow us to use older
versions of the docker container for older branches. Contributors
interested in Python development were looking at this for testing Py2 and
Py3 at the same time[1].


> Best,
>
> michel
>
>
>
1:
https://lists.apache.org/thread.html/4be0f687135b7c6778224dd76389a39f9ebf78a3cf9c4cb4e76ebb73@%3Cdev.beam.apache.org%3E


Re: [ANNOUNCE] New PMC Member: Pablo Estrada

2019-05-15 Thread Charles Chen
Congrats Pablo and thank you for your contributions!

On Wed, May 15, 2019, 10:53 AM Valentyn Tymofieiev 
wrote:

> Congrats, Pablo!
>
> On Wed, May 15, 2019 at 10:41 AM Yifan Zou  wrote:
>
>> Congratulations, Pablo!
>>
>> *From: *Maximilian Michels 
>> *Date: *Wed, May 15, 2019 at 2:06 AM
>> *To: * 
>>
>> Congrats Pablo! Thank you for your help to grow the Beam community!
>>>
>>> On 15.05.19 10:33, Tim Robertson wrote:
>>> > Congratulations Pablo
>>> >
>>> > On Wed, May 15, 2019 at 10:22 AM Ismaël Mejía >> > > wrote:
>>> >
>>> > Congrats Pablo, well deserved, nece to see your work recognized!
>>> >
>>> > On Wed, May 15, 2019 at 9:59 AM Pei HE >> > > wrote:
>>> >  >
>>> >  > Congrats, Pablo!
>>> >  >
>>> >  > On Tue, May 14, 2019 at 11:41 PM Tanay Tummalapalli
>>> >  > mailto:ttanay.apa...@gmail.com>>
>>> wrote:
>>> >  > >
>>> >  > > Congratulations Pablo!
>>> >  > >
>>> >  > > On Wed, May 15, 2019, 12:08 Michael Luckey <
>>> adude3...@gmail.com
>>> > > wrote:
>>> >  > >>
>>> >  > >> Congrats, Pablo!
>>> >  > >>
>>> >  > >> On Wed, May 15, 2019 at 8:21 AM Connell O'Callaghan
>>> > mailto:conne...@google.com>> wrote:
>>> >  > >>>
>>> >  > >>> Awesome well done Pablo!!!
>>> >  > >>>
>>> >  > >>> Kenn thank you for sharing this great news with us!!!
>>> >  > >>>
>>> >  > >>> On Tue, May 14, 2019 at 11:01 PM Ahmet Altay
>>> > mailto:al...@google.com>> wrote:
>>> >  > 
>>> >  >  Congratulations!
>>> >  > 
>>> >  >  On Tue, May 14, 2019 at 9:11 PM Robert Burke
>>> > mailto:rob...@frantil.com>> wrote:
>>> >  > >
>>> >  > > Woohoo! Well deserved.
>>> >  > >
>>> >  > > On Tue, May 14, 2019, 8:34 PM Reuven Lax <
>>> re...@google.com
>>> > > wrote:
>>> >  > >>
>>> >  > >> Congratulations!
>>> >  > >>
>>> >  > >> From: Mikhail Gryzykhin >> > >
>>> >  > >> Date: Tue, May 14, 2019 at 8:32 PM
>>> >  > >> To: mailto:dev@beam.apache.org>>
>>> >  > >>
>>> >  > >>> Congratulations Pablo!
>>> >  > >>>
>>> >  > >>> On Tue, May 14, 2019, 20:25 Kenneth Knowles
>>> > mailto:k...@apache.org>> wrote:
>>> >  > 
>>> >  >  Hi all,
>>> >  > 
>>> >  >  Please join me and the rest of the Beam PMC in
>>> welcoming
>>> > Pablo Estrada to join the PMC.
>>> >  > 
>>> >  >  Pablo first picked up BEAM-722 in October of 2016 and
>>> > has been a steady part of the Beam community since then. In
>>> addition
>>> > to technical work on Beam Python & Java & runners, I would
>>> highlight
>>> > how Pablo grows Beam's community by helping users, working on GSoC,
>>> > giving talks at Beam Summits and other OSS conferences including
>>> > Flink Forward, and holding training workshops. I cannot do justice
>>> > to Pablo's contributions in a single paragraph.
>>> >  > 
>>> >  >  Thanks Pablo, for being a part of Beam.
>>> >  > 
>>> >  >  Kenn
>>> >
>>>
>>


Re: [ANNOUNCE] New PMC Member: Pablo Estrada

2019-05-15 Thread Valentyn Tymofieiev
Congrats, Pablo!

On Wed, May 15, 2019 at 10:41 AM Yifan Zou  wrote:

> Congratulations, Pablo!
>
> *From: *Maximilian Michels 
> *Date: *Wed, May 15, 2019 at 2:06 AM
> *To: * 
>
> Congrats Pablo! Thank you for your help to grow the Beam community!
>>
>> On 15.05.19 10:33, Tim Robertson wrote:
>> > Congratulations Pablo
>> >
>> > On Wed, May 15, 2019 at 10:22 AM Ismaël Mejía > > > wrote:
>> >
>> > Congrats Pablo, well deserved, nece to see your work recognized!
>> >
>> > On Wed, May 15, 2019 at 9:59 AM Pei HE > > > wrote:
>> >  >
>> >  > Congrats, Pablo!
>> >  >
>> >  > On Tue, May 14, 2019 at 11:41 PM Tanay Tummalapalli
>> >  > mailto:ttanay.apa...@gmail.com>>
>> wrote:
>> >  > >
>> >  > > Congratulations Pablo!
>> >  > >
>> >  > > On Wed, May 15, 2019, 12:08 Michael Luckey <
>> adude3...@gmail.com
>> > > wrote:
>> >  > >>
>> >  > >> Congrats, Pablo!
>> >  > >>
>> >  > >> On Wed, May 15, 2019 at 8:21 AM Connell O'Callaghan
>> > mailto:conne...@google.com>> wrote:
>> >  > >>>
>> >  > >>> Awesome well done Pablo!!!
>> >  > >>>
>> >  > >>> Kenn thank you for sharing this great news with us!!!
>> >  > >>>
>> >  > >>> On Tue, May 14, 2019 at 11:01 PM Ahmet Altay
>> > mailto:al...@google.com>> wrote:
>> >  > 
>> >  >  Congratulations!
>> >  > 
>> >  >  On Tue, May 14, 2019 at 9:11 PM Robert Burke
>> > mailto:rob...@frantil.com>> wrote:
>> >  > >
>> >  > > Woohoo! Well deserved.
>> >  > >
>> >  > > On Tue, May 14, 2019, 8:34 PM Reuven Lax > > > wrote:
>> >  > >>
>> >  > >> Congratulations!
>> >  > >>
>> >  > >> From: Mikhail Gryzykhin > > >
>> >  > >> Date: Tue, May 14, 2019 at 8:32 PM
>> >  > >> To: mailto:dev@beam.apache.org>>
>> >  > >>
>> >  > >>> Congratulations Pablo!
>> >  > >>>
>> >  > >>> On Tue, May 14, 2019, 20:25 Kenneth Knowles
>> > mailto:k...@apache.org>> wrote:
>> >  > 
>> >  >  Hi all,
>> >  > 
>> >  >  Please join me and the rest of the Beam PMC in welcoming
>> > Pablo Estrada to join the PMC.
>> >  > 
>> >  >  Pablo first picked up BEAM-722 in October of 2016 and
>> > has been a steady part of the Beam community since then. In addition
>> > to technical work on Beam Python & Java & runners, I would highlight
>> > how Pablo grows Beam's community by helping users, working on GSoC,
>> > giving talks at Beam Summits and other OSS conferences including
>> > Flink Forward, and holding training workshops. I cannot do justice
>> > to Pablo's contributions in a single paragraph.
>> >  > 
>> >  >  Thanks Pablo, for being a part of Beam.
>> >  > 
>> >  >  Kenn
>> >
>>
>


Re: [ANNOUNCE] New PMC Member: Pablo Estrada

2019-05-15 Thread Chamikara Jayalath
Congrats Pablo!!

On Wed, May 15, 2019 at 10:46 AM Alan Myrvold  wrote:

> Congrats, Pablo!
>
> *From: *Robin Qiu 
> *Date: *Wed, May 15, 2019 at 10:44 AM
> *To: * 
>
> Congratulations, Pablo!!
>>
>> On Wed, May 15, 2019 at 10:43 AM Pablo Estrada 
>> wrote:
>>
>>> Thanks everyone for the encouragement, and thanks to the PMC for the
>>> recognition. I am honored and grateful. :)
>>> Best
>>> -P.
>>>
>>>
>>> *From: *Kenneth Knowles 
>>> *Date: *Tue, May 14, 2019, 10:25 PM
>>> *To: *dev
>>>
>>> Hi all,

 Please join me and the rest of the Beam PMC in welcoming Pablo Estrada
 to join the PMC.

 Pablo first picked up BEAM-722 in October of 2016 and has been a steady
 part of the Beam community since then. In addition to technical work on
 Beam Python & Java & runners, I would highlight how Pablo grows Beam's
 community by helping users, working on GSoC, giving talks at Beam Summits
 and other OSS conferences including Flink Forward, and holding training
 workshops. I cannot do justice to Pablo's contributions in a single
 paragraph.

 Thanks Pablo, for being a part of Beam.

 Kenn

>>>


Re: [ANNOUNCE] New PMC Member: Pablo Estrada

2019-05-15 Thread Alan Myrvold
Congrats, Pablo!

*From: *Robin Qiu 
*Date: *Wed, May 15, 2019 at 10:44 AM
*To: * 

Congratulations, Pablo!!
>
> On Wed, May 15, 2019 at 10:43 AM Pablo Estrada  wrote:
>
>> Thanks everyone for the encouragement, and thanks to the PMC for the
>> recognition. I am honored and grateful. :)
>> Best
>> -P.
>>
>>
>> *From: *Kenneth Knowles 
>> *Date: *Tue, May 14, 2019, 10:25 PM
>> *To: *dev
>>
>> Hi all,
>>>
>>> Please join me and the rest of the Beam PMC in welcoming Pablo Estrada
>>> to join the PMC.
>>>
>>> Pablo first picked up BEAM-722 in October of 2016 and has been a steady
>>> part of the Beam community since then. In addition to technical work on
>>> Beam Python & Java & runners, I would highlight how Pablo grows Beam's
>>> community by helping users, working on GSoC, giving talks at Beam Summits
>>> and other OSS conferences including Flink Forward, and holding training
>>> workshops. I cannot do justice to Pablo's contributions in a single
>>> paragraph.
>>>
>>> Thanks Pablo, for being a part of Beam.
>>>
>>> Kenn
>>>
>>


Re: [ANNOUNCE] New PMC Member: Pablo Estrada

2019-05-15 Thread Robin Qiu
Congratulations, Pablo!!

On Wed, May 15, 2019 at 10:43 AM Pablo Estrada  wrote:

> Thanks everyone for the encouragement, and thanks to the PMC for the
> recognition. I am honored and grateful. :)
> Best
> -P.
>
>
> *From: *Kenneth Knowles 
> *Date: *Tue, May 14, 2019, 10:25 PM
> *To: *dev
>
> Hi all,
>>
>> Please join me and the rest of the Beam PMC in welcoming Pablo Estrada to
>> join the PMC.
>>
>> Pablo first picked up BEAM-722 in October of 2016 and has been a steady
>> part of the Beam community since then. In addition to technical work on
>> Beam Python & Java & runners, I would highlight how Pablo grows Beam's
>> community by helping users, working on GSoC, giving talks at Beam Summits
>> and other OSS conferences including Flink Forward, and holding training
>> workshops. I cannot do justice to Pablo's contributions in a single
>> paragraph.
>>
>> Thanks Pablo, for being a part of Beam.
>>
>> Kenn
>>
>


Re: [ANNOUNCE] New PMC Member: Pablo Estrada

2019-05-15 Thread Pablo Estrada
Thanks everyone for the encouragement, and thanks to the PMC for the
recognition. I am honored and grateful. :)
Best
-P.


*From: *Kenneth Knowles 
*Date: *Tue, May 14, 2019, 10:25 PM
*To: *dev

Hi all,
>
> Please join me and the rest of the Beam PMC in welcoming Pablo Estrada to
> join the PMC.
>
> Pablo first picked up BEAM-722 in October of 2016 and has been a steady
> part of the Beam community since then. In addition to technical work on
> Beam Python & Java & runners, I would highlight how Pablo grows Beam's
> community by helping users, working on GSoC, giving talks at Beam Summits
> and other OSS conferences including Flink Forward, and holding training
> workshops. I cannot do justice to Pablo's contributions in a single
> paragraph.
>
> Thanks Pablo, for being a part of Beam.
>
> Kenn
>


Re: [ANNOUNCE] New PMC Member: Pablo Estrada

2019-05-15 Thread Yifan Zou
Congratulations, Pablo!

*From: *Maximilian Michels 
*Date: *Wed, May 15, 2019 at 2:06 AM
*To: * 

Congrats Pablo! Thank you for your help to grow the Beam community!
>
> On 15.05.19 10:33, Tim Robertson wrote:
> > Congratulations Pablo
> >
> > On Wed, May 15, 2019 at 10:22 AM Ismaël Mejía  > > wrote:
> >
> > Congrats Pablo, well deserved, nece to see your work recognized!
> >
> > On Wed, May 15, 2019 at 9:59 AM Pei HE  > > wrote:
> >  >
> >  > Congrats, Pablo!
> >  >
> >  > On Tue, May 14, 2019 at 11:41 PM Tanay Tummalapalli
> >  > mailto:ttanay.apa...@gmail.com>> wrote:
> >  > >
> >  > > Congratulations Pablo!
> >  > >
> >  > > On Wed, May 15, 2019, 12:08 Michael Luckey  > > wrote:
> >  > >>
> >  > >> Congrats, Pablo!
> >  > >>
> >  > >> On Wed, May 15, 2019 at 8:21 AM Connell O'Callaghan
> > mailto:conne...@google.com>> wrote:
> >  > >>>
> >  > >>> Awesome well done Pablo!!!
> >  > >>>
> >  > >>> Kenn thank you for sharing this great news with us!!!
> >  > >>>
> >  > >>> On Tue, May 14, 2019 at 11:01 PM Ahmet Altay
> > mailto:al...@google.com>> wrote:
> >  > 
> >  >  Congratulations!
> >  > 
> >  >  On Tue, May 14, 2019 at 9:11 PM Robert Burke
> > mailto:rob...@frantil.com>> wrote:
> >  > >
> >  > > Woohoo! Well deserved.
> >  > >
> >  > > On Tue, May 14, 2019, 8:34 PM Reuven Lax  > > wrote:
> >  > >>
> >  > >> Congratulations!
> >  > >>
> >  > >> From: Mikhail Gryzykhin  > >
> >  > >> Date: Tue, May 14, 2019 at 8:32 PM
> >  > >> To: mailto:dev@beam.apache.org>>
> >  > >>
> >  > >>> Congratulations Pablo!
> >  > >>>
> >  > >>> On Tue, May 14, 2019, 20:25 Kenneth Knowles
> > mailto:k...@apache.org>> wrote:
> >  > 
> >  >  Hi all,
> >  > 
> >  >  Please join me and the rest of the Beam PMC in welcoming
> > Pablo Estrada to join the PMC.
> >  > 
> >  >  Pablo first picked up BEAM-722 in October of 2016 and
> > has been a steady part of the Beam community since then. In addition
> > to technical work on Beam Python & Java & runners, I would highlight
> > how Pablo grows Beam's community by helping users, working on GSoC,
> > giving talks at Beam Summits and other OSS conferences including
> > Flink Forward, and holding training workshops. I cannot do justice
> > to Pablo's contributions in a single paragraph.
> >  > 
> >  >  Thanks Pablo, for being a part of Beam.
> >  > 
> >  >  Kenn
> >
>


Re: Contributing Beam Kata (Java & Python)

2019-05-15 Thread Austin Bennett
Stepik: should we thinking about release versions for what gets uploaded
there?  If the point of merging was to encourage additional contributions -
then should we also have a mechanism for publishing the updates there?  In
that case, then uploading to stepik should be part of publishing releases?

On Tue, May 14, 2019 at 10:51 PM hsuryawira...@google.com <
hsuryawira...@google.com> wrote:

> Thanks for merging it Reuven!
>
> Quick question, would it be useful if we write a blog post on the Kata so
> that we can build more awareness for people to try out?
> I've also uploaded the course to Stepik which has seamless integration
> within the IDE for people to easily start the course.
>
> On 2019/05/14 20:35:07, Reuven Lax  wrote:
> > Merged
> >
> > *From: *Reza Rokni 
> > *Date: *Tue, May 14, 2019 at 1:29 PM
> > *To: * 
> > *Cc: *Lars Francke
> >
> > +1 :-)
> > >
> > > *From: *Lukasz Cwik 
> > > *Date: *Wed, 15 May 2019 at 04:29
> > > *To: *dev
> > > *Cc: *Lars Francke
> > >
> > > +1
> > >>
> > >> *From: *Pablo Estrada 
> > >> *Date: *Tue, May 14, 2019 at 1:27 PM
> > >> *To: *dev
> > >> *Cc: *Lars Francke
> > >>
> > >> +1 on merging.
> > >>>
> > >>> *From: *Reuven Lax 
> > >>> *Date: *Tue, May 14, 2019 at 1:23 PM
> > >>> *To: *dev
> > >>> *Cc: *Lars Francke
> > >>>
> > >>> I've been playing around with this that past day .or two, and it's
> >  great! I'm inclined to merge this PR (if nobody objects) so that
> others in
> >  the community can contribute more training katas.
> > 
> >  Reuven
> > 
> >  *From: *Ismaël Mejía 
> >  *Date: *Tue, Apr 23, 2019 at 6:43 AM
> >  *To: *Lars Francke
> >  *Cc: * 
> > 
> >  Thanks for answering Lars,
> > >
> > > The 'interesting' part is that the tutorial has a full IDE
> integrated
> > > experience based on the Jetbrains edu platform [1]. So maybe
> > > interesting to see if it could make sense to have projects like
> this
> > > in the new trainings incubator project or if they became too
> platform
> > > constrained.
> > >
> > > This contribution is valuable for Beam but the community may decide
> > > that it makes sense for it to live at some moment at the trainings
> > > project. I suppose also Henry could be interested in taking a look
> at
> > > this [2].
> > >
> > > [1] https://www.jetbrains.com/education/
> > > [2] https://incubator.apache.org/clutch/training.html
> > >
> > > On Tue, Apr 23, 2019 at 3:00 PM Lars Francke <
> lars.fran...@gmail.com>
> > > wrote:
> > > >
> > > > Thanks Ismaël.
> > > >
> > > > I must admit I'm a tad confused. What has JetBrains got to do
> with
> > > this?
> > > > This looks pretty cool and specific to Beam though, or is this
> more
> > > generic?
> > > > But yeah something along those lines could be interesting for
> > > hands-on type things in training.
> > > >
> > > > On Fri, Apr 19, 2019 at 12:10 PM Ismaël Mejía  >
> > > wrote:
> > > >>
> > > >> +lars.fran...@gmail.com who is in the Apache training project
> and
> > > may
> > > >> be interested in this one or at least the JetBrains like
> approach.
> > > >>
> > > >> On Fri, Apr 19, 2019 at 12:01 PM Ismaël Mejía <
> ieme...@gmail.com>
> > > wrote:
> > > >> >
> > > >> > This looks great, nice for bringing this to the project Henry!
> > > >> >
> > > >> > On Fri, Apr 19, 2019 at 10:53 AM hsuryawira...@google.com
> > > >> >  wrote:
> > > >> > >
> > > >> > > Thanks Altay.
> > > >> > > I'll create it under "learning/" first as this is not
> exactly
> > > example.
> > > >> > > Please do let me know if it's not the right place.
> > > >> > >
> > > >> > > On 2019/04/18 22:49:47, Ahmet Altay 
> wrote:
> > > >> > > > This looks great.
> > > >> > > >
> > > >> > > > +David Cavazos  was working on
> > > interactive colab based
> > > >> > > > examples (https://github.com/apache/beam/pull/7679)
> perhaps
> > > we can have a
> > > >> > > > shared place for these two similar things.
> > > >> > > >
> > > >> > >
> > >
> > 
> > >
> > > --
> > >
> > > This email may be confidential and privileged. If you received this
> > > communication by mistake, please don't forward it to anyone else,
> please
> > > erase all copies and attachments, and please let me know that it has
> gone
> > > to the wrong person.
> > >
> > > The above terms reflect a potential business arrangement, are provided
> > > solely as a basis for further discussion, and are not intended to be
> and do
> > > not constitute a legally binding obligation. No legally binding
> obligations
> > > will be created, implied, or inferred until an agreement in final form
> is
> > > executed in writing by all parties involved.
> > >
> >
>


Semantics of PCollection.isBounded

2019-05-15 Thread Jan Lukavský

Hi,

I have come across unexpected (at least for me) behavior of some 
apparent inconsistency of how a PCollection is processed in DirectRunner 
and what PCollection.isBounded signals. Let me explain:


 - I have a stateful ParDo, which needs to make sure that elements 
arrive in order - it accomplishes this by defining BagState for 
buffering input elements and sorting them inside this buffer, it also 
keeps track of element with highest timestamp to somehow estimate local 
watermark (minus some allowed lateness), to know when to remove elements 
from the buffer, sort them by time and pass them to some (time ordered) 
processing


 - this seems to work well for streaming (unbounded) data

 - for batch (bounded) data the semantics of stateful ParDo should be 
(please correct me if I'm wrong) that elements always arrive in order, 
because the runner can sort them by timestamp


 - this implies that for batch processed input (bounded) the 
allowedLateness can be set to zero, so that the processing is little 
more effective, because it doesn't have to use the BagState at all


 - now, the trouble seems to be, that DirectRunner always uses 
streaming processing, even if the input is bounded (that is by 
definition possible), but there is no way now to know when it is 
possible to change allowed lateness to zero (because input will arrive 
ordered)


 - so - it seems to me, that either DirectRunner should apply sorting 
to stateful ParDo, when it processes bounded data (the same way that 
other runners do), or it can apply streaming processing, but then it 
should change PCollection.isBounded to UNBOUNDED, even if the input is 
originally bounded


 - that way, the semantics of PCollection.isBounded, would be not if 
the data are known in advance to be finite, but *how* the data are going 
to be processed, which is much more valuable (IMO)


Any thoughts?

 Jan



Re: Intro

2019-05-15 Thread Cyrus Maden
Welcome!

On Tue, May 14, 2019 at 4:36 PM Robert Burke  wrote:

> Welcome aboard :D
>
> On Tue, 14 May 2019 at 13:28, Ahmet Altay  wrote:
>
>> Welcome! Added you as a contributor to JIRA.
>>
>> *From: *Damien Desfontaines 
>> *Date: *Tue, May 14, 2019 at 1:24 PM
>> *To: * 
>>
>> Hi folks,
>>>
>>> I'm Damien from the Anonymization team at Google. I might contribute a
>>> couple of PRs on the Go SDK. Can someone give me permission to assign Jira
>>> tickets to myself? My username is desfontaines.
>>>
>>> Thanks in advance!
>>>
>>> Damien
>>>
>>> --
>>> I'm working part-time, so I might not see your emails immediately. See
>>> go/man-ddf for more info =)
>>>
>>


Re: PardoLifeCycle: Teardown after failed call to setup

2019-05-15 Thread Robert Burke
What is the runner supposed to be doing to trigger the teardown of given
bundle descriptors in an SDK harness?

Is there a fn API call I'm not interpreting correctly that should reliably
trigger DoFn teardown, or generally that bundle processing is done?



On Wed, May 15, 2019, 6:51 AM Robert Bradshaw  wrote:

> This does bring up an interesting question though. Are runners
> violating (the intent of) the spec if they simply abandon/kill workers
> rather than gracefully bringing them down (e.g. so that these
> callbacks can be invoked)?
>
> On Tue, May 7, 2019 at 3:55 PM Michael Luckey  wrote:
> >
> > Thanks Kenn and Reuven. Based on your feedback, I amended to the PR [1]
> implementing the missing calls to teardown.
> >
> > Best,
> >
> > michel
> >
> > [1] https://github.com/apache/beam/pull/8495
> >
> > On Tue, May 7, 2019 at 6:09 AM Kenneth Knowles  wrote:
> >>
> >>
> >>
> >> On Mon, May 6, 2019 at 2:19 PM Reuven Lax  wrote:
> >>>
> >>>
> >>>
> >>> On Mon, May 6, 2019 at 2:06 PM Kenneth Knowles 
> wrote:
> 
>  The specification of TearDown is that it is best effort, certainly.
> >>>
> >>>
> >>> Though I believe the intent of that specification was that a runner
> will call it as long as the process itself has not crashed.
> >>
> >>
> >> Yea, exactly. Or more abstractly that a runner will call it unless it
> is impossible. If the hardware fails, a meteor strikes, etc, then teardown
> will not be called. But in normal operation, particularly when the user
> code throws a recoverable exception, it should be called.
> >>
> >> Kenn
> >>
> >>>
> >>>
> 
>  If your runner supports it, then the test is good to make sure there
> is not a regression. If your runner has partial support, that is within
> spec. But the idea of the spec is more than you might have such a failure
> that it is impossible to call the method, not simply never trying to call
> it.
> 
>  I think it seems to match what we do elsewhere to leave the test, add
> an annotation, make a note in the capability matrix about the limitation on
> ParDo.
> 
>  Kenn
> 
>  On Mon, May 6, 2019 at 5:45 AM Michael Luckey 
> wrote:
> >
> > Hi,
> >
> > after stumbling upon [1] and trying to implement a fix [2],
> ParDoLifeCycleTest are failing for
> > direct runner, spark validatesRunnerBatch and flink
> validatesRunnerBatch fail as DoFns teardown is not invoked, if DoFns setup
> throw an exceptions.
> >
> > This seems to be in line with the specification [3], as this
> explicitly states that 'teardown might not be called if unnecessary as
> processed will be killed anyway'.
> >
> > No I am a bit lost on how to resolve this situation. Currently, we
> seem to have following options
> > - remove the test, although it seems valuable in different (e.g.
> streaming?) cases
> > - to satisfy the test implement the call to teardown in runners
> although it seems unnecessary
> > - add another annotation @CallsTeardownAfterFailingSetup,
> @UsesFullParDoLifeCycle or such (would love to get suggestions for better
> name here)
> > - ?
> >
> > Thoughts?
> >
> > Best,
> >
> > michel
> >
> >
> >
> > [1] https://issues.apache.org/jira/browse/BEAM-7197
> > [2] https://github.com/apache/beam/pull/8495
> > [3]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/DoFn.java#L676-L680
>


Re: PardoLifeCycle: Teardown after failed call to setup

2019-05-15 Thread Robert Bradshaw
This does bring up an interesting question though. Are runners
violating (the intent of) the spec if they simply abandon/kill workers
rather than gracefully bringing them down (e.g. so that these
callbacks can be invoked)?

On Tue, May 7, 2019 at 3:55 PM Michael Luckey  wrote:
>
> Thanks Kenn and Reuven. Based on your feedback, I amended to the PR [1] 
> implementing the missing calls to teardown.
>
> Best,
>
> michel
>
> [1] https://github.com/apache/beam/pull/8495
>
> On Tue, May 7, 2019 at 6:09 AM Kenneth Knowles  wrote:
>>
>>
>>
>> On Mon, May 6, 2019 at 2:19 PM Reuven Lax  wrote:
>>>
>>>
>>>
>>> On Mon, May 6, 2019 at 2:06 PM Kenneth Knowles  wrote:

 The specification of TearDown is that it is best effort, certainly.
>>>
>>>
>>> Though I believe the intent of that specification was that a runner will 
>>> call it as long as the process itself has not crashed.
>>
>>
>> Yea, exactly. Or more abstractly that a runner will call it unless it is 
>> impossible. If the hardware fails, a meteor strikes, etc, then teardown will 
>> not be called. But in normal operation, particularly when the user code 
>> throws a recoverable exception, it should be called.
>>
>> Kenn
>>
>>>
>>>

 If your runner supports it, then the test is good to make sure there is 
 not a regression. If your runner has partial support, that is within spec. 
 But the idea of the spec is more than you might have such a failure that 
 it is impossible to call the method, not simply never trying to call it.

 I think it seems to match what we do elsewhere to leave the test, add an 
 annotation, make a note in the capability matrix about the limitation on 
 ParDo.

 Kenn

 On Mon, May 6, 2019 at 5:45 AM Michael Luckey  wrote:
>
> Hi,
>
> after stumbling upon [1] and trying to implement a fix [2], 
> ParDoLifeCycleTest are failing for
> direct runner, spark validatesRunnerBatch and flink validatesRunnerBatch 
> fail as DoFns teardown is not invoked, if DoFns setup throw an exceptions.
>
> This seems to be in line with the specification [3], as this explicitly 
> states that 'teardown might not be called if unnecessary as processed 
> will be killed anyway'.
>
> No I am a bit lost on how to resolve this situation. Currently, we seem 
> to have following options
> - remove the test, although it seems valuable in different (e.g. 
> streaming?) cases
> - to satisfy the test implement the call to teardown in runners although 
> it seems unnecessary
> - add another annotation @CallsTeardownAfterFailingSetup, 
> @UsesFullParDoLifeCycle or such (would love to get suggestions for better 
> name here)
> - ?
>
> Thoughts?
>
> Best,
>
> michel
>
>
>
> [1] https://issues.apache.org/jira/browse/BEAM-7197
> [2] https://github.com/apache/beam/pull/8495
> [3] 
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/DoFn.java#L676-L680


Re: Problem with gzip

2019-05-15 Thread Robert Bradshaw
Interesting thread. Thanks for digging that up.

I would try the shuffle_mode=service experiment (forgot that wasn't
yet the default). If that doesn't do the trick, though avro as a
materialization format does not provide perfect parallelism, it should
be significantly better than what you have now (large gzip files) and
may be good enough.

On Wed, May 15, 2019 at 2:34 PM Michael Luckey  wrote:
>
> @Robert
>
> Does your suggestion imply, that the points made by Eugene on BEAM-2803 do 
> not apply (anymore) and the combined reshuffle could just be omitted?
>
> On Wed, May 15, 2019 at 1:00 PM Robert Bradshaw  wrote:
>>
>> Unfortunately the "write" portion of the reshuffle cannot be
>> parallelized more than the source that it's reading from. In my
>> experience, generally the read is the bottleneck in this case, but
>> it's possible (e.g. if the input compresses extremely well) that it is
>> the write that is slow (which you seem to indicate based on your
>> observation of the UI, right?).
>>
>> It could be that materializing to temporary files is cheaper than
>> materializing randomly to shuffle (especially on pre-portable Python).
>> In that case you could force a fusion break with a side input instead.
>> E.g.
>>
>> class FusionBreak(beam.PTransform):
>> def expand(self, pcoll):
>> # Create an empty PCollection that depends on pcoll.
>> empty = pcoll | beam.FlatMap(lambda x: ())
>> # Use this empty PCollection as a side input, which will force
>> a fusion break.
>> return pcoll | beam.Map(lambda x, unused: x,
>> beam.pvalue.AsIterable(empty))
>>
>> which could be used in place of Reshard like
>>
>> p | beam.ReadFromGzipedFiles(...) | FusionBreak() | DoWork() ...
>>
>> You'll probably want to be sure to pass the use_fastavro experiment as well.
>>
>> On Wed, May 15, 2019 at 6:53 AM Niels Basjes  wrote:
>> >
>> > Hi
>> >
>> > This project is a completely different solution towards this problem, but 
>> > in the hadoop mapreduce context.
>> >
>> > https://github.com/nielsbasjes/splittablegzip
>> >
>> >
>> > I have used this a lot in the past.
>> > Perhaps porting this project to beam is an option?
>> >
>> > Niels Basjes
>> >
>> >
>> >
>> > On Tue, May 14, 2019, 20:45 Lukasz Cwik  wrote:
>> >>
>> >> Sorry I couldn't be more helpful.
>> >>
>> >> From: Allie Chen 
>> >> Date: Tue, May 14, 2019 at 10:09 AM
>> >> To: 
>> >> Cc: user
>> >>
>> >>> Thank Lukasz. Unfortunately, decompressing the files is not an option 
>> >>> for us.
>> >>>
>> >>>
>> >>> I am trying to speed up Reshuffle step, since it waits for all data. 
>> >>> Here are two ways I have tried:
>> >>>
>> >>> 1.  add timestamps to the PCollection's elements after reading (since it 
>> >>> is bounded source), then apply windowing before Reshuffle, but it still 
>> >>> waits all data.
>> >>>
>> >>>
>> >>> 2.  run the pipeline with --streaming flag, but it leads to an error: 
>> >>> Workflow failed. Causes: Expected custom source to have non-zero number 
>> >>> of splits. Also, I found in 
>> >>> https://beam.apache.org/documentation/sdks/python-streaming/#dataflowrunner-specific-features:
>> >>>
>> >>> DataflowRunner does not currently support the following Cloud Dataflow 
>> >>> specific features with Python streaming execution.
>> >>>
>> >>> Streaming autoscaling
>> >>>
>> >>> I doubt whether this approach can solve my issue.
>> >>>
>> >>>
>> >>> Thanks so much!
>> >>>
>> >>> Allie
>> >>>
>> >>>
>> >>> From: Lukasz Cwik 
>> >>> Date: Tue, May 14, 2019 at 11:16 AM
>> >>> To: dev
>> >>> Cc: user
>> >>>
>>  Do you need to perform any joins across the files (e.g. 
>>  Combine.perKey/GroupByKey/...)?
>>  If not, you could structure your pipeline
>>  ReadFromFileA --> Reshuffle(optional) --> CopyOfPipelineA
>>  ReadFromFileB --> Reshuffle(optional) --> CopyOfPipelineB
>>  ReadFromFileC --> Reshuffle(optional) --> CopyOfPipelineC
>>  and then run it as a batch pipeline.
>> 
>>  You can set --streaming=true on the pipeline and then it will run in a 
>>  streaming mode but streaming prioritizes low latency and correctness on 
>>  Google Cloud Dataflow so it will cost more to run your pipeline then in 
>>  batch mode. It may make more sense to store the data uncompressed as it 
>>  may be less expensive then paying the additional compute cost for 
>>  streaming.
>> 
>>  From: Allie Chen 
>>  Date: Tue, May 14, 2019 at 7:38 AM
>>  To: 
>>  Cc: user
>> 
>> > Is it possible to use windowing or somehow pretend it is streaming so 
>> > Reshuffle or GroupByKey won't wait until all data has been read?
>> >
>> > Thanks!
>> > Allie
>> >
>> > From: Lukasz Cwik 
>> > Date: Fri, May 10, 2019 at 5:36 PM
>> > To: dev
>> > Cc: user
>> >
>> >> There is no such flag to turn of fusion.
>> >>
>> >> Writing 100s of GiBs of uncompressed data to reshuffle will take time 
>> >> when it is limited to a 

Re: Problem with gzip

2019-05-15 Thread Michael Luckey
@Robert

Does your suggestion imply, that the points made by Eugene on BEAM-2803 do
not apply (anymore) and the combined reshuffle could just be omitted?

On Wed, May 15, 2019 at 1:00 PM Robert Bradshaw  wrote:

> Unfortunately the "write" portion of the reshuffle cannot be
> parallelized more than the source that it's reading from. In my
> experience, generally the read is the bottleneck in this case, but
> it's possible (e.g. if the input compresses extremely well) that it is
> the write that is slow (which you seem to indicate based on your
> observation of the UI, right?).
>
> It could be that materializing to temporary files is cheaper than
> materializing randomly to shuffle (especially on pre-portable Python).
> In that case you could force a fusion break with a side input instead.
> E.g.
>
> class FusionBreak(beam.PTransform):
> def expand(self, pcoll):
> # Create an empty PCollection that depends on pcoll.
> empty = pcoll | beam.FlatMap(lambda x: ())
> # Use this empty PCollection as a side input, which will force
> a fusion break.
> return pcoll | beam.Map(lambda x, unused: x,
> beam.pvalue.AsIterable(empty))
>
> which could be used in place of Reshard like
>
> p | beam.ReadFromGzipedFiles(...) | FusionBreak() | DoWork() ...
>
> You'll probably want to be sure to pass the use_fastavro experiment as
> well.
>
> On Wed, May 15, 2019 at 6:53 AM Niels Basjes  wrote:
> >
> > Hi
> >
> > This project is a completely different solution towards this problem,
> but in the hadoop mapreduce context.
> >
> > https://github.com/nielsbasjes/splittablegzip
> >
> >
> > I have used this a lot in the past.
> > Perhaps porting this project to beam is an option?
> >
> > Niels Basjes
> >
> >
> >
> > On Tue, May 14, 2019, 20:45 Lukasz Cwik  wrote:
> >>
> >> Sorry I couldn't be more helpful.
> >>
> >> From: Allie Chen 
> >> Date: Tue, May 14, 2019 at 10:09 AM
> >> To: 
> >> Cc: user
> >>
> >>> Thank Lukasz. Unfortunately, decompressing the files is not an option
> for us.
> >>>
> >>>
> >>> I am trying to speed up Reshuffle step, since it waits for all data.
> Here are two ways I have tried:
> >>>
> >>> 1.  add timestamps to the PCollection's elements after reading (since
> it is bounded source), then apply windowing before Reshuffle, but it still
> waits all data.
> >>>
> >>>
> >>> 2.  run the pipeline with --streaming flag, but it leads to an error:
> Workflow failed. Causes: Expected custom source to have non-zero number of
> splits. Also, I found in
> https://beam.apache.org/documentation/sdks/python-streaming/#dataflowrunner-specific-features
> :
> >>>
> >>> DataflowRunner does not currently support the following Cloud Dataflow
> specific features with Python streaming execution.
> >>>
> >>> Streaming autoscaling
> >>>
> >>> I doubt whether this approach can solve my issue.
> >>>
> >>>
> >>> Thanks so much!
> >>>
> >>> Allie
> >>>
> >>>
> >>> From: Lukasz Cwik 
> >>> Date: Tue, May 14, 2019 at 11:16 AM
> >>> To: dev
> >>> Cc: user
> >>>
>  Do you need to perform any joins across the files (e.g.
> Combine.perKey/GroupByKey/...)?
>  If not, you could structure your pipeline
>  ReadFromFileA --> Reshuffle(optional) --> CopyOfPipelineA
>  ReadFromFileB --> Reshuffle(optional) --> CopyOfPipelineB
>  ReadFromFileC --> Reshuffle(optional) --> CopyOfPipelineC
>  and then run it as a batch pipeline.
> 
>  You can set --streaming=true on the pipeline and then it will run in
> a streaming mode but streaming prioritizes low latency and correctness on
> Google Cloud Dataflow so it will cost more to run your pipeline then in
> batch mode. It may make more sense to store the data uncompressed as it may
> be less expensive then paying the additional compute cost for streaming.
> 
>  From: Allie Chen 
>  Date: Tue, May 14, 2019 at 7:38 AM
>  To: 
>  Cc: user
> 
> > Is it possible to use windowing or somehow pretend it is streaming
> so Reshuffle or GroupByKey won't wait until all data has been read?
> >
> > Thanks!
> > Allie
> >
> > From: Lukasz Cwik 
> > Date: Fri, May 10, 2019 at 5:36 PM
> > To: dev
> > Cc: user
> >
> >> There is no such flag to turn of fusion.
> >>
> >> Writing 100s of GiBs of uncompressed data to reshuffle will take
> time when it is limited to a small number of workers.
> >>
> >> If you can split up your input into a lot of smaller files that are
> compressed then you shouldn't need to use the reshuffle but still could if
> you found it helped.
> >>
> >> On Fri, May 10, 2019 at 2:24 PM Allie Chen 
> wrote:
> >>>
> >>> Re Lukasz: Thanks! I am not able to control the compression format
> but I will see whether the splitting gzip files will work. Is there a
> simple flag in Dataflow that could turn off the fusion?
> >>>
> >>> Re Reuven: No, I checked the run time on Dataflow UI, the
> GroupByKey and FlatMap in Reshuffle are very slow 

Re: [VOTE] Remove deprecated Java Reference Runner code from repository.

2019-05-15 Thread Michael Luckey
+1

On Wed, May 15, 2019 at 2:17 PM Alex Van Boxel  wrote:

> +1
>
> (best commits are the once that remove code :-)
>  _/
> _/ Alex Van Boxel
>
>
> On Wed, May 15, 2019 at 2:04 PM Manu Zhang 
> wrote:
>
>> +1
>>
>> On Wed, May 15, 2019 at 7:57 PM Maximilian Michels 
>> wrote:
>>
>>> +1
>>>
>>> On 15.05.19 13:19, Robert Bradshaw wrote:
>>> > +1 for removing the code given the current state of things.
>>> >
>>> > On Wed, May 15, 2019 at 12:32 AM Ruoyun Huang 
>>> wrote:
>>> >>
>>> >> +1
>>> >>
>>> >> From: Daniel Oliveira 
>>> >> Date: Tue, May 14, 2019 at 2:19 PM
>>> >> To: dev
>>> >>
>>> >>> Hello everyone,
>>> >>>
>>> >>> I'm calling for a vote on removing the deprecated Java Reference
>>> Runner code. The PR for the change has already been tested and reviewed:
>>> https://github.com/apache/beam/pull/8380
>>> >>>
>>> >>> [ ] +1, Approve merging the removal PR in it's current state
>>> >>> [ ] -1, Veto the removal PR (please provide specific comments)
>>> >>>
>>> >>> The vote will be open for at least 72 hours. Since this a vote on
>>> code-modification, it is adopted if there are at least 3 PMC affirmative
>>> votes and no vetoes.
>>> >>>
>>> >>> For those who would like context on why the Java Reference Runner is
>>> being deprecated, the discussions took place in the following email threads:
>>> >>>
>>> >>> (8 Feb. 2019) Thoughts on a reference runner to invest in? -
>>> Decision to deprecate the Java Reference Runner and use the Python
>>> FnApiRunner for those use cases instead.
>>> >>> (14 Mar. 2019) Python PVR Reference post-commit tests failing -
>>> Removal of Reference Runner Post-Commits from Jenkins, and discussion on
>>> removal of code.
>>> >>> (25 Apr. 2019) Removing Java Reference Runner code - Discussion
>>> thread before this formal vote.
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> 
>>> >> Ruoyun  Huang
>>> >>
>>>
>>


Re: pickling typing types in Python 3.5+

2019-05-15 Thread Robert Bradshaw
(2) seems reasonable.

On Tue, May 14, 2019 at 3:15 AM Udi Meiri  wrote:
>
> It seems like pickling of typing types is broken in 3.5 and 3.6, fixed in 3.7:
> https://github.com/python/typing/issues/511
>
> Here are my attempts:
> https://gist.github.com/udim/ec213305ca865390c391001e8778e91d
>
>
> My ideas:
> 1. I know that we override type object handling in pickler.py 
> (_nested_type_wrapper), and perhaps this mechanism can be used to pickle 
> typing classes correctly. The question is how.
>
> 2. Exclude/stub out these classes when pickling a pipeline - they are only 
> used for verification during pipeline construction anyway. This could be a 
> temporary solution for versions 3.5 and 3.6.
>
> Any ideas / opinions?


Re: [VOTE] Remove deprecated Java Reference Runner code from repository.

2019-05-15 Thread Alex Van Boxel
+1

(best commits are the once that remove code :-)
 _/
_/ Alex Van Boxel


On Wed, May 15, 2019 at 2:04 PM Manu Zhang  wrote:

> +1
>
> On Wed, May 15, 2019 at 7:57 PM Maximilian Michels  wrote:
>
>> +1
>>
>> On 15.05.19 13:19, Robert Bradshaw wrote:
>> > +1 for removing the code given the current state of things.
>> >
>> > On Wed, May 15, 2019 at 12:32 AM Ruoyun Huang 
>> wrote:
>> >>
>> >> +1
>> >>
>> >> From: Daniel Oliveira 
>> >> Date: Tue, May 14, 2019 at 2:19 PM
>> >> To: dev
>> >>
>> >>> Hello everyone,
>> >>>
>> >>> I'm calling for a vote on removing the deprecated Java Reference
>> Runner code. The PR for the change has already been tested and reviewed:
>> https://github.com/apache/beam/pull/8380
>> >>>
>> >>> [ ] +1, Approve merging the removal PR in it's current state
>> >>> [ ] -1, Veto the removal PR (please provide specific comments)
>> >>>
>> >>> The vote will be open for at least 72 hours. Since this a vote on
>> code-modification, it is adopted if there are at least 3 PMC affirmative
>> votes and no vetoes.
>> >>>
>> >>> For those who would like context on why the Java Reference Runner is
>> being deprecated, the discussions took place in the following email threads:
>> >>>
>> >>> (8 Feb. 2019) Thoughts on a reference runner to invest in? - Decision
>> to deprecate the Java Reference Runner and use the Python FnApiRunner for
>> those use cases instead.
>> >>> (14 Mar. 2019) Python PVR Reference post-commit tests failing -
>> Removal of Reference Runner Post-Commits from Jenkins, and discussion on
>> removal of code.
>> >>> (25 Apr. 2019) Removing Java Reference Runner code - Discussion
>> thread before this formal vote.
>> >>
>> >>
>> >>
>> >> --
>> >> 
>> >> Ruoyun  Huang
>> >>
>>
>


Re: Developing a new beam runner for Twister2

2019-05-15 Thread Maximilian Michels
+1 Portability is the way forward. If you have to choose between the 
two, go for the portable one. For educational purposes, I'd still 
suggest checking out the "legacy" Runners. Actually, a new Runner could 
implement both Runner styles with most of the code shared between the two.


-Max

On 15.05.19 11:47, Robert Bradshaw wrote:

I would strongly suggest new runners adapt the portability runner from
the start, which will be more forward compatible and more flexible
(e.g. supporting other languages). The primary difference is that
rather than wrapping individual DoFns, one wraps a "fused" bundle of
DoFns (called an ExecutableStage). As it looks liek Twister2 is
written in Java, you can take advantage of much of the existing Java
libraries that already do this that are shared among the other Java
runners.

On Tue, May 14, 2019 at 7:55 PM Pulasthi Supun Wickramasinghe
 wrote:


Hi,

Thanks Kenn and Max for the information. Will read up a little more and discuss 
with the Twister2 team before deciding on which route to take. I also created 
an issue in BEAM JIRA[1], but I cannot assign this to my self would someone be 
able to assign the issue to me. Thanks in advance.

[1] https://issues.apache.org/jira/browse/BEAM-7304

Best Regards
Pulasthi

On Tue, May 14, 2019 at 6:19 AM Maximilian Michels  wrote:


Hi Pulasthi,

Great to hear you're planning to implement a Twister2 Runner.

If you have limited time, you probably want to decide whether to build a
"legacy" Java Runner or a portable one. They are not fundamentally
different but there are some tricky implementation details for the
portable Runner related to the asynchronous communication with the SDK
Harness.

If you have enough time, first implementing a "legacy" Runner might be a
good way to learn the Beam model and subsequently creating a portable
Runner should not be hard then.

To get an idea of the differences, check out the Flink source code:
- FlinkStreamingTransformTranslators (Java "legacy")
- FlinkStreamingPortablePipelineTranslator (portable)

Feel free to ask questions here or on Slack.

Cheers,
Max

On 14.05.19 05:11, Kenneth Knowles wrote:

Welcome! This is very cool to hear about.

A major caveat about https://beam.apache.org/contribute/runner-guide/ is
that it was written when Beam's portability framework was more of a
sketch. The conceptual descriptions are mostly fine, but the pointers to
Java helper code will lead you to build a "legacy" runner when it is
better to build a portable runner from the start*.

We now have four portable runners in various levels of completeness:
Spark, Flink, Samza, and Dataflow. I have added some relevant people to
the CC for emphasis. You might also join
https://the-asf.slack.com/#beam-portability though I prefer the dev list
since it gives visibility to a much greater portion of the community.

Kenn

*volunteers welcome to update the guide to emphasize portability first

*From: *Pulasthi Supun Wickramasinghe mailto:pulasthi...@gmail.com>>
*Date: *Mon, May 13, 2019 at 11:03 AM
*To: * mailto:dev@beam.apache.org>>

 Hi All,

 I am Pulasthi a Ph.D. student at Indiana University. We are planning
 to develop a beam runner for our project Twister2 [1] [2]. Twister2
 is a big data framework which supports both batch and stream
 processing. If you are interested you can find more information on
 [2] or read some of our publications [3]

 I wanted to share our intent and get some guidance from the beam
 developer community before starting on the project. I was planning
 on going through the code for Apache Spark and Apache Flink runners
 to get a better understanding of what I need to do. It would be
 great if I can get any pointers on how I should approach this
 project. I am currently reading through the runner-guide
 .

 Finally, I assume that I need to create a JIRA issue to track the
 progress of this project, right?. I can create the issue but from
 what I read from the contribute section I would need some permission
 to assign it to my self, I hope someone would be able to help me
 with that. Looking forward to working with the Beam community.

 [1] https://github.com/DSC-SPIDAL/twister2
 [2] https://twister2.gitbook.io/twister2/
 [3] https://twister2.gitbook.io/twister2/publications

 Best Regards,
 Pulasthi
 --
 Pulasthi S. Wickramasinghe
 PhD Candidate  | Research Assistant
 School of Informatics and Computing | Digital Science Center
 Indiana University, Bloomington
 cell: 224-386-9035





--
Pulasthi S. Wickramasinghe
PhD Candidate  | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington
cell: 224-386-9035


Re: [VOTE] Remove deprecated Java Reference Runner code from repository.

2019-05-15 Thread Manu Zhang
+1

On Wed, May 15, 2019 at 7:57 PM Maximilian Michels  wrote:

> +1
>
> On 15.05.19 13:19, Robert Bradshaw wrote:
> > +1 for removing the code given the current state of things.
> >
> > On Wed, May 15, 2019 at 12:32 AM Ruoyun Huang  wrote:
> >>
> >> +1
> >>
> >> From: Daniel Oliveira 
> >> Date: Tue, May 14, 2019 at 2:19 PM
> >> To: dev
> >>
> >>> Hello everyone,
> >>>
> >>> I'm calling for a vote on removing the deprecated Java Reference
> Runner code. The PR for the change has already been tested and reviewed:
> https://github.com/apache/beam/pull/8380
> >>>
> >>> [ ] +1, Approve merging the removal PR in it's current state
> >>> [ ] -1, Veto the removal PR (please provide specific comments)
> >>>
> >>> The vote will be open for at least 72 hours. Since this a vote on
> code-modification, it is adopted if there are at least 3 PMC affirmative
> votes and no vetoes.
> >>>
> >>> For those who would like context on why the Java Reference Runner is
> being deprecated, the discussions took place in the following email threads:
> >>>
> >>> (8 Feb. 2019) Thoughts on a reference runner to invest in? - Decision
> to deprecate the Java Reference Runner and use the Python FnApiRunner for
> those use cases instead.
> >>> (14 Mar. 2019) Python PVR Reference post-commit tests failing -
> Removal of Reference Runner Post-Commits from Jenkins, and discussion on
> removal of code.
> >>> (25 Apr. 2019) Removing Java Reference Runner code - Discussion thread
> before this formal vote.
> >>
> >>
> >>
> >> --
> >> 
> >> Ruoyun  Huang
> >>
>


Re: [VOTE] Remove deprecated Java Reference Runner code from repository.

2019-05-15 Thread Maximilian Michels

+1

On 15.05.19 13:19, Robert Bradshaw wrote:

+1 for removing the code given the current state of things.

On Wed, May 15, 2019 at 12:32 AM Ruoyun Huang  wrote:


+1

From: Daniel Oliveira 
Date: Tue, May 14, 2019 at 2:19 PM
To: dev


Hello everyone,

I'm calling for a vote on removing the deprecated Java Reference Runner code. 
The PR for the change has already been tested and reviewed: 
https://github.com/apache/beam/pull/8380

[ ] +1, Approve merging the removal PR in it's current state
[ ] -1, Veto the removal PR (please provide specific comments)

The vote will be open for at least 72 hours. Since this a vote on 
code-modification, it is adopted if there are at least 3 PMC affirmative votes 
and no vetoes.

For those who would like context on why the Java Reference Runner is being 
deprecated, the discussions took place in the following email threads:

(8 Feb. 2019) Thoughts on a reference runner to invest in? - Decision to 
deprecate the Java Reference Runner and use the Python FnApiRunner for those 
use cases instead.
(14 Mar. 2019) Python PVR Reference post-commit tests failing - Removal of 
Reference Runner Post-Commits from Jenkins, and discussion on removal of code.
(25 Apr. 2019) Removing Java Reference Runner code - Discussion thread before 
this formal vote.




--

Ruoyun  Huang



Re: [VOTE] Remove deprecated Java Reference Runner code from repository.

2019-05-15 Thread Robert Bradshaw
+1 for removing the code given the current state of things.

On Wed, May 15, 2019 at 12:32 AM Ruoyun Huang  wrote:
>
> +1
>
> From: Daniel Oliveira 
> Date: Tue, May 14, 2019 at 2:19 PM
> To: dev
>
>> Hello everyone,
>>
>> I'm calling for a vote on removing the deprecated Java Reference Runner 
>> code. The PR for the change has already been tested and reviewed: 
>> https://github.com/apache/beam/pull/8380
>>
>> [ ] +1, Approve merging the removal PR in it's current state
>> [ ] -1, Veto the removal PR (please provide specific comments)
>>
>> The vote will be open for at least 72 hours. Since this a vote on 
>> code-modification, it is adopted if there are at least 3 PMC affirmative 
>> votes and no vetoes.
>>
>> For those who would like context on why the Java Reference Runner is being 
>> deprecated, the discussions took place in the following email threads:
>>
>> (8 Feb. 2019) Thoughts on a reference runner to invest in? - Decision to 
>> deprecate the Java Reference Runner and use the Python FnApiRunner for those 
>> use cases instead.
>> (14 Mar. 2019) Python PVR Reference post-commit tests failing - Removal of 
>> Reference Runner Post-Commits from Jenkins, and discussion on removal of 
>> code.
>> (25 Apr. 2019) Removing Java Reference Runner code - Discussion thread 
>> before this formal vote.
>
>
>
> --
> 
> Ruoyun  Huang
>


Re: Problem with gzip

2019-05-15 Thread Robert Bradshaw
Unfortunately the "write" portion of the reshuffle cannot be
parallelized more than the source that it's reading from. In my
experience, generally the read is the bottleneck in this case, but
it's possible (e.g. if the input compresses extremely well) that it is
the write that is slow (which you seem to indicate based on your
observation of the UI, right?).

It could be that materializing to temporary files is cheaper than
materializing randomly to shuffle (especially on pre-portable Python).
In that case you could force a fusion break with a side input instead.
E.g.

class FusionBreak(beam.PTransform):
def expand(self, pcoll):
# Create an empty PCollection that depends on pcoll.
empty = pcoll | beam.FlatMap(lambda x: ())
# Use this empty PCollection as a side input, which will force
a fusion break.
return pcoll | beam.Map(lambda x, unused: x,
beam.pvalue.AsIterable(empty))

which could be used in place of Reshard like

p | beam.ReadFromGzipedFiles(...) | FusionBreak() | DoWork() ...

You'll probably want to be sure to pass the use_fastavro experiment as well.

On Wed, May 15, 2019 at 6:53 AM Niels Basjes  wrote:
>
> Hi
>
> This project is a completely different solution towards this problem, but in 
> the hadoop mapreduce context.
>
> https://github.com/nielsbasjes/splittablegzip
>
>
> I have used this a lot in the past.
> Perhaps porting this project to beam is an option?
>
> Niels Basjes
>
>
>
> On Tue, May 14, 2019, 20:45 Lukasz Cwik  wrote:
>>
>> Sorry I couldn't be more helpful.
>>
>> From: Allie Chen 
>> Date: Tue, May 14, 2019 at 10:09 AM
>> To: 
>> Cc: user
>>
>>> Thank Lukasz. Unfortunately, decompressing the files is not an option for 
>>> us.
>>>
>>>
>>> I am trying to speed up Reshuffle step, since it waits for all data. Here 
>>> are two ways I have tried:
>>>
>>> 1.  add timestamps to the PCollection's elements after reading (since it is 
>>> bounded source), then apply windowing before Reshuffle, but it still waits 
>>> all data.
>>>
>>>
>>> 2.  run the pipeline with --streaming flag, but it leads to an error: 
>>> Workflow failed. Causes: Expected custom source to have non-zero number of 
>>> splits. Also, I found in 
>>> https://beam.apache.org/documentation/sdks/python-streaming/#dataflowrunner-specific-features:
>>>
>>> DataflowRunner does not currently support the following Cloud Dataflow 
>>> specific features with Python streaming execution.
>>>
>>> Streaming autoscaling
>>>
>>> I doubt whether this approach can solve my issue.
>>>
>>>
>>> Thanks so much!
>>>
>>> Allie
>>>
>>>
>>> From: Lukasz Cwik 
>>> Date: Tue, May 14, 2019 at 11:16 AM
>>> To: dev
>>> Cc: user
>>>
 Do you need to perform any joins across the files (e.g. 
 Combine.perKey/GroupByKey/...)?
 If not, you could structure your pipeline
 ReadFromFileA --> Reshuffle(optional) --> CopyOfPipelineA
 ReadFromFileB --> Reshuffle(optional) --> CopyOfPipelineB
 ReadFromFileC --> Reshuffle(optional) --> CopyOfPipelineC
 and then run it as a batch pipeline.

 You can set --streaming=true on the pipeline and then it will run in a 
 streaming mode but streaming prioritizes low latency and correctness on 
 Google Cloud Dataflow so it will cost more to run your pipeline then in 
 batch mode. It may make more sense to store the data uncompressed as it 
 may be less expensive then paying the additional compute cost for 
 streaming.

 From: Allie Chen 
 Date: Tue, May 14, 2019 at 7:38 AM
 To: 
 Cc: user

> Is it possible to use windowing or somehow pretend it is streaming so 
> Reshuffle or GroupByKey won't wait until all data has been read?
>
> Thanks!
> Allie
>
> From: Lukasz Cwik 
> Date: Fri, May 10, 2019 at 5:36 PM
> To: dev
> Cc: user
>
>> There is no such flag to turn of fusion.
>>
>> Writing 100s of GiBs of uncompressed data to reshuffle will take time 
>> when it is limited to a small number of workers.
>>
>> If you can split up your input into a lot of smaller files that are 
>> compressed then you shouldn't need to use the reshuffle but still could 
>> if you found it helped.
>>
>> On Fri, May 10, 2019 at 2:24 PM Allie Chen  wrote:
>>>
>>> Re Lukasz: Thanks! I am not able to control the compression format but 
>>> I will see whether the splitting gzip files will work. Is there a 
>>> simple flag in Dataflow that could turn off the fusion?
>>>
>>> Re Reuven: No, I checked the run time on Dataflow UI, the GroupByKey 
>>> and FlatMap in Reshuffle are very slow when the data is large. 
>>> Reshuffle itself is not parallel either.
>>>
>>> Thanks all,
>>>
>>> Allie
>>>
>>> From: Reuven Lax 
>>> Date: Fri, May 10, 2019 at 5:02 PM
>>> To: dev
>>> Cc: user
>>>
 It's unlikely that Reshuffle itself takes hours. It's more likely that 
 simply reading 

Re: SqlTransform Metadata

2019-05-15 Thread Robert Bradshaw
Isn't there an API for concisely computing new fields from old ones?
Perhaps these expressions could contain references to metadata value
such as timestamp. Otherwise,

Rather than withMetadata reifying the value as a nested field, with
the timestamp, window, etc. at the top level, one could let it take a
field name argument that attaches all the metadata as an extra
(struct-like) field. This would be like attachX, but without having to
have a separate method for every X.

It seems restrictive to only consider this a a special mode for
SqlTransform rather than a more generic operation. (For SQL, my first
instinct would be to just make this a special function like
element_timestamp(), but there is some ambiguity there when there are
multiple tables in the expression.)

On Wed, May 15, 2019 at 5:03 AM Reza Rokni  wrote:
>
> Hi,
>
> One use case would be when dealing with the windowing functions for example:
>
> SELECT f_int, COUNT(*) , TUMBLE_START(f_timestamp, INTERVAL '1' HOUR) 
> tumble_start
>   FROM PCOLLECTION
>   GROUP BY
> f_int,
> TUMBLE(f_timestamp, INTERVAL '1' HOUR)
>
> For an element which is using Metadata to inform the EvenTime of the element, 
> rather than data within the element itself, I would need to create a new 
> schema which added the timestamp as a field. I think other examples which 
> maybe interesting is getting the value of a row with the max/min timestamp. 
> None of this would be difficult but it does feel a little on the verbose side 
> and also makes the pipeline a little harder to read.
>
> Cheers
> Reza
>
>
>
>
>
> From: Kenneth Knowles 
> Date: Wed, 15 May 2019 at 01:15
> To: dev
>
>> We have support for nested rows so this should be easy. The .withMetadata 
>> would reify the struct, moving from Row to WindowedValue if I 
>> understand it...
>>
>> SqlTransform.query("SELECT field1 from PCOLLECTION"):
>>
>> Schema = {
>>   field1: type1,
>>   field2: type2
>> }
>>
>> SqlTransform.query(...)
>>
>> SqlTransform.withMetadata().query("SELECT event_timestamp, value.field1 FROM 
>> PCOLLECTION")
>>
>> Derived schema = {
>>   event_timestamp: TIMESTAMP,
>>   pane_info: { ... }
>>   value: {
>> field1: type1,
>> field2: type2,
>> ...
>>   }
>> }
>>
>> SqlTransform would expand into a different composite, and it would be a 
>> straightforward ParDo to adjust the data, possibly automatic via the new 
>> schema conversions.
>>
>> Embedding the window would be a bit wonky, something like { end_of_window: 
>> TIMESTAMP, encoded_window: bytes } which would be expensive due to encoding. 
>> But timestamp and pane info not so bad.
>>
>> Kenn
>>
>> From: Anton Kedin 
>> Date: Tue, May 14, 2019 at 9:17 AM
>> To: 
>>
>>> Reza, can you share more thoughts on how you think this can work end-to-end?
>>>
>>> Currently the approach is that populating the rows with the data happens 
>>> before the SqlTransform, and within the query you can only use the things 
>>> that are already in the rows or in the catalog/schema (or built-in things). 
>>> In general case populating the rows with any data can be solved via a ParDo 
>>> before SqlTransform. Do you think this approach lacks something or maybe 
>>> too verbose?
>>>
>>> My thoughts on this, lacking more info or concrete examples: in order to 
>>> access a timestamp value from within a query there has to be a syntax for 
>>> it. Field access expressions or function calls are the only things that 
>>> come to mind among existing syntax features that would allow that. Making 
>>> timestamp a field of the data row makes more sense to me here because in 
>>> Beam it is already a part of the element. It's not a result of a function 
>>> call and it's already easily accessible, doesn't make sense to build extra 
>>> functions here. One of the problems with both approaches however is the 
>>> potential conflicts with the existing schema of the data elements (or the 
>>> schema/catalog of the data source in general). E.g. if we add a magical 
>>> "event_timestamp" column or "event_timestamp()" function there may 
>>> potentially already exist a field or a function in the schema with this 
>>> name. This can be solved in couple of ways, but we will probably want to 
>>> provide a configuration mechanism to assign a different field/function 
>>> names in case of conflicts.
>>>
>>> Given that, it may make sense to allow users to attach the whole pane info 
>>> or some subset of it to the row (e.g. only the timestamp), and make that 
>>> configurable. However I am not sure whether exposing something like pane 
>>> info is enough and will cover a lot of useful cases. Plus adding methods 
>>> like `attachTimestamp("fieldname")` or `attachWindowInfo("fieldname")` 
>>> might open a portal to ever-increasing collection of these `attachX()`, 
>>> `attachY()` that can make the API less usable. If on the other hand we 
>>> would make it more generic then it will probably have to look a lot 

Re: Developing a new beam runner for Twister2

2019-05-15 Thread Robert Bradshaw
I would strongly suggest new runners adapt the portability runner from
the start, which will be more forward compatible and more flexible
(e.g. supporting other languages). The primary difference is that
rather than wrapping individual DoFns, one wraps a "fused" bundle of
DoFns (called an ExecutableStage). As it looks liek Twister2 is
written in Java, you can take advantage of much of the existing Java
libraries that already do this that are shared among the other Java
runners.

On Tue, May 14, 2019 at 7:55 PM Pulasthi Supun Wickramasinghe
 wrote:
>
> Hi,
>
> Thanks Kenn and Max for the information. Will read up a little more and 
> discuss with the Twister2 team before deciding on which route to take. I also 
> created an issue in BEAM JIRA[1], but I cannot assign this to my self would 
> someone be able to assign the issue to me. Thanks in advance.
>
> [1] https://issues.apache.org/jira/browse/BEAM-7304
>
> Best Regards
> Pulasthi
>
> On Tue, May 14, 2019 at 6:19 AM Maximilian Michels  wrote:
>>
>> Hi Pulasthi,
>>
>> Great to hear you're planning to implement a Twister2 Runner.
>>
>> If you have limited time, you probably want to decide whether to build a
>> "legacy" Java Runner or a portable one. They are not fundamentally
>> different but there are some tricky implementation details for the
>> portable Runner related to the asynchronous communication with the SDK
>> Harness.
>>
>> If you have enough time, first implementing a "legacy" Runner might be a
>> good way to learn the Beam model and subsequently creating a portable
>> Runner should not be hard then.
>>
>> To get an idea of the differences, check out the Flink source code:
>> - FlinkStreamingTransformTranslators (Java "legacy")
>> - FlinkStreamingPortablePipelineTranslator (portable)
>>
>> Feel free to ask questions here or on Slack.
>>
>> Cheers,
>> Max
>>
>> On 14.05.19 05:11, Kenneth Knowles wrote:
>> > Welcome! This is very cool to hear about.
>> >
>> > A major caveat about https://beam.apache.org/contribute/runner-guide/ is
>> > that it was written when Beam's portability framework was more of a
>> > sketch. The conceptual descriptions are mostly fine, but the pointers to
>> > Java helper code will lead you to build a "legacy" runner when it is
>> > better to build a portable runner from the start*.
>> >
>> > We now have four portable runners in various levels of completeness:
>> > Spark, Flink, Samza, and Dataflow. I have added some relevant people to
>> > the CC for emphasis. You might also join
>> > https://the-asf.slack.com/#beam-portability though I prefer the dev list
>> > since it gives visibility to a much greater portion of the community.
>> >
>> > Kenn
>> >
>> > *volunteers welcome to update the guide to emphasize portability first
>> >
>> > *From: *Pulasthi Supun Wickramasinghe > > >
>> > *Date: *Mon, May 13, 2019 at 11:03 AM
>> > *To: * mailto:dev@beam.apache.org>>
>> >
>> > Hi All,
>> >
>> > I am Pulasthi a Ph.D. student at Indiana University. We are planning
>> > to develop a beam runner for our project Twister2 [1] [2]. Twister2
>> > is a big data framework which supports both batch and stream
>> > processing. If you are interested you can find more information on
>> > [2] or read some of our publications [3]
>> >
>> > I wanted to share our intent and get some guidance from the beam
>> > developer community before starting on the project. I was planning
>> > on going through the code for Apache Spark and Apache Flink runners
>> > to get a better understanding of what I need to do. It would be
>> > great if I can get any pointers on how I should approach this
>> > project. I am currently reading through the runner-guide
>> > .
>> >
>> > Finally, I assume that I need to create a JIRA issue to track the
>> > progress of this project, right?. I can create the issue but from
>> > what I read from the contribute section I would need some permission
>> > to assign it to my self, I hope someone would be able to help me
>> > with that. Looking forward to working with the Beam community.
>> >
>> > [1] https://github.com/DSC-SPIDAL/twister2
>> > [2] https://twister2.gitbook.io/twister2/
>> > [3] https://twister2.gitbook.io/twister2/publications
>> >
>> > Best Regards,
>> > Pulasthi
>> > --
>> > Pulasthi S. Wickramasinghe
>> > PhD Candidate  | Research Assistant
>> > School of Informatics and Computing | Digital Science Center
>> > Indiana University, Bloomington
>> > cell: 224-386-9035
>> >
>
>
>
> --
> Pulasthi S. Wickramasinghe
> PhD Candidate  | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
> cell: 224-386-9035


Re: [ANNOUNCE] New PMC Member: Pablo Estrada

2019-05-15 Thread Maximilian Michels

Congrats Pablo! Thank you for your help to grow the Beam community!

On 15.05.19 10:33, Tim Robertson wrote:

Congratulations Pablo

On Wed, May 15, 2019 at 10:22 AM Ismaël Mejía > wrote:


Congrats Pablo, well deserved, nece to see your work recognized!

On Wed, May 15, 2019 at 9:59 AM Pei HE mailto:pei...@gmail.com>> wrote:
 >
 > Congrats, Pablo!
 >
 > On Tue, May 14, 2019 at 11:41 PM Tanay Tummalapalli
 > mailto:ttanay.apa...@gmail.com>> wrote:
 > >
 > > Congratulations Pablo!
 > >
 > > On Wed, May 15, 2019, 12:08 Michael Luckey mailto:adude3...@gmail.com>> wrote:
 > >>
 > >> Congrats, Pablo!
 > >>
 > >> On Wed, May 15, 2019 at 8:21 AM Connell O'Callaghan
mailto:conne...@google.com>> wrote:
 > >>>
 > >>> Awesome well done Pablo!!!
 > >>>
 > >>> Kenn thank you for sharing this great news with us!!!
 > >>>
 > >>> On Tue, May 14, 2019 at 11:01 PM Ahmet Altay
mailto:al...@google.com>> wrote:
 > 
 >  Congratulations!
 > 
 >  On Tue, May 14, 2019 at 9:11 PM Robert Burke
mailto:rob...@frantil.com>> wrote:
 > >
 > > Woohoo! Well deserved.
 > >
 > > On Tue, May 14, 2019, 8:34 PM Reuven Lax mailto:re...@google.com>> wrote:
 > >>
 > >> Congratulations!
 > >>
 > >> From: Mikhail Gryzykhin mailto:gryzykhin.mikh...@gmail.com>>
 > >> Date: Tue, May 14, 2019 at 8:32 PM
 > >> To: mailto:dev@beam.apache.org>>
 > >>
 > >>> Congratulations Pablo!
 > >>>
 > >>> On Tue, May 14, 2019, 20:25 Kenneth Knowles
mailto:k...@apache.org>> wrote:
 > 
 >  Hi all,
 > 
 >  Please join me and the rest of the Beam PMC in welcoming
Pablo Estrada to join the PMC.
 > 
 >  Pablo first picked up BEAM-722 in October of 2016 and
has been a steady part of the Beam community since then. In addition
to technical work on Beam Python & Java & runners, I would highlight
how Pablo grows Beam's community by helping users, working on GSoC,
giving talks at Beam Summits and other OSS conferences including
Flink Forward, and holding training workshops. I cannot do justice
to Pablo's contributions in a single paragraph.
 > 
 >  Thanks Pablo, for being a part of Beam.
 > 
 >  Kenn



Re: [ANNOUNCE] New PMC Member: Pablo Estrada

2019-05-15 Thread Robert Bradshaw
Congratulations, Pablo!

From: Ismaël Mejía 
Date: Wed, May 15, 2019 at 10:22 AM
To: 

> Congrats Pablo, well deserved, nece to see your work recognized!
>
> On Wed, May 15, 2019 at 9:59 AM Pei HE  wrote:
> >
> > Congrats, Pablo!
> >
> > On Tue, May 14, 2019 at 11:41 PM Tanay Tummalapalli
> >  wrote:
> > >
> > > Congratulations Pablo!
> > >
> > > On Wed, May 15, 2019, 12:08 Michael Luckey  wrote:
> > >>
> > >> Congrats, Pablo!
> > >>
> > >> On Wed, May 15, 2019 at 8:21 AM Connell O'Callaghan 
> > >>  wrote:
> > >>>
> > >>> Awesome well done Pablo!!!
> > >>>
> > >>> Kenn thank you for sharing this great news with us!!!
> > >>>
> > >>> On Tue, May 14, 2019 at 11:01 PM Ahmet Altay  wrote:
> > 
> >  Congratulations!
> > 
> >  On Tue, May 14, 2019 at 9:11 PM Robert Burke  
> >  wrote:
> > >
> > > Woohoo! Well deserved.
> > >
> > > On Tue, May 14, 2019, 8:34 PM Reuven Lax  wrote:
> > >>
> > >> Congratulations!
> > >>
> > >> From: Mikhail Gryzykhin 
> > >> Date: Tue, May 14, 2019 at 8:32 PM
> > >> To: 
> > >>
> > >>> Congratulations Pablo!
> > >>>
> > >>> On Tue, May 14, 2019, 20:25 Kenneth Knowles  wrote:
> > 
> >  Hi all,
> > 
> >  Please join me and the rest of the Beam PMC in welcoming Pablo 
> >  Estrada to join the PMC.
> > 
> >  Pablo first picked up BEAM-722 in October of 2016 and has been a 
> >  steady part of the Beam community since then. In addition to 
> >  technical work on Beam Python & Java & runners, I would highlight 
> >  how Pablo grows Beam's community by helping users, working on 
> >  GSoC, giving talks at Beam Summits and other OSS conferences 
> >  including Flink Forward, and holding training workshops. I cannot 
> >  do justice to Pablo's contributions in a single paragraph.
> > 
> >  Thanks Pablo, for being a part of Beam.
> > 
> >  Kenn


Re: [ANNOUNCE] New PMC Member: Pablo Estrada

2019-05-15 Thread Ismaël Mejía
Congrats Pablo, well deserved, nece to see your work recognized!

On Wed, May 15, 2019 at 9:59 AM Pei HE  wrote:
>
> Congrats, Pablo!
>
> On Tue, May 14, 2019 at 11:41 PM Tanay Tummalapalli
>  wrote:
> >
> > Congratulations Pablo!
> >
> > On Wed, May 15, 2019, 12:08 Michael Luckey  wrote:
> >>
> >> Congrats, Pablo!
> >>
> >> On Wed, May 15, 2019 at 8:21 AM Connell O'Callaghan  
> >> wrote:
> >>>
> >>> Awesome well done Pablo!!!
> >>>
> >>> Kenn thank you for sharing this great news with us!!!
> >>>
> >>> On Tue, May 14, 2019 at 11:01 PM Ahmet Altay  wrote:
> 
>  Congratulations!
> 
>  On Tue, May 14, 2019 at 9:11 PM Robert Burke  wrote:
> >
> > Woohoo! Well deserved.
> >
> > On Tue, May 14, 2019, 8:34 PM Reuven Lax  wrote:
> >>
> >> Congratulations!
> >>
> >> From: Mikhail Gryzykhin 
> >> Date: Tue, May 14, 2019 at 8:32 PM
> >> To: 
> >>
> >>> Congratulations Pablo!
> >>>
> >>> On Tue, May 14, 2019, 20:25 Kenneth Knowles  wrote:
> 
>  Hi all,
> 
>  Please join me and the rest of the Beam PMC in welcoming Pablo 
>  Estrada to join the PMC.
> 
>  Pablo first picked up BEAM-722 in October of 2016 and has been a 
>  steady part of the Beam community since then. In addition to 
>  technical work on Beam Python & Java & runners, I would highlight 
>  how Pablo grows Beam's community by helping users, working on GSoC, 
>  giving talks at Beam Summits and other OSS conferences including 
>  Flink Forward, and holding training workshops. I cannot do justice 
>  to Pablo's contributions in a single paragraph.
> 
>  Thanks Pablo, for being a part of Beam.
> 
>  Kenn


Re: [ANNOUNCE] New PMC Member: Pablo Estrada

2019-05-15 Thread Pei HE
Congrats, Pablo!

On Tue, May 14, 2019 at 11:41 PM Tanay Tummalapalli
 wrote:
>
> Congratulations Pablo!
>
> On Wed, May 15, 2019, 12:08 Michael Luckey  wrote:
>>
>> Congrats, Pablo!
>>
>> On Wed, May 15, 2019 at 8:21 AM Connell O'Callaghan  
>> wrote:
>>>
>>> Awesome well done Pablo!!!
>>>
>>> Kenn thank you for sharing this great news with us!!!
>>>
>>> On Tue, May 14, 2019 at 11:01 PM Ahmet Altay  wrote:

 Congratulations!

 On Tue, May 14, 2019 at 9:11 PM Robert Burke  wrote:
>
> Woohoo! Well deserved.
>
> On Tue, May 14, 2019, 8:34 PM Reuven Lax  wrote:
>>
>> Congratulations!
>>
>> From: Mikhail Gryzykhin 
>> Date: Tue, May 14, 2019 at 8:32 PM
>> To: 
>>
>>> Congratulations Pablo!
>>>
>>> On Tue, May 14, 2019, 20:25 Kenneth Knowles  wrote:

 Hi all,

 Please join me and the rest of the Beam PMC in welcoming Pablo Estrada 
 to join the PMC.

 Pablo first picked up BEAM-722 in October of 2016 and has been a 
 steady part of the Beam community since then. In addition to technical 
 work on Beam Python & Java & runners, I would highlight how Pablo 
 grows Beam's community by helping users, working on GSoC, giving talks 
 at Beam Summits and other OSS conferences including Flink Forward, and 
 holding training workshops. I cannot do justice to Pablo's 
 contributions in a single paragraph.

 Thanks Pablo, for being a part of Beam.

 Kenn


Dealing with incompatible changes in build system on LTS releases

2019-05-15 Thread Michael Luckey
Hi,

do we currently have a strategy on how to handle LTS releases in context of
incompatible changes  on the build system?

As far as I can see, the problem is (at least) twofold.

1. Incompatible changes on test-infra job definitions

There might be changes in our groovy files which make it impossible to
build/test an old branch on Jenkins. How do we intend to handle this? Of
course, in that cases we could run seed job and reset Jenkins to
corresponding old state but this will impact or even stall development on
master.

2. Incompatible changes on agents

Even worse, we might introduce changes on the agents itself, which will
even render it impossible to successfully seed to that legacy state. Do we
have any option to revert to an old Jenkins agent setup in such cases? I am
currently unaware of a link from apache repo to Jenkins configuration state
to enable restauration of (old) agents? Is there such thing?

Would it be possible/helpful to subdivide our Jenkins agent pool in some
way that seed job could be run only on a dedicated subgroup (which then
could be set to an old state)? If I recall correctly Yifan put a lot of
effort in migrating our agents to the newer jnlp approach.
 and used a 'private' agent to do require testing. I assume this has been a
manual setup and is not automated to be useful in such cases?

What do others think about this issue? Is it something to follow on or more
of a non issue?

Best,

michel


Re: [ANNOUNCE] New PMC Member: Pablo Estrada

2019-05-15 Thread Tanay Tummalapalli
Congratulations Pablo!

On Wed, May 15, 2019, 12:08 Michael Luckey  wrote:

> Congrats, Pablo!
>
> On Wed, May 15, 2019 at 8:21 AM Connell O'Callaghan 
> wrote:
>
>> Awesome well done Pablo!!!
>>
>> Kenn thank you for sharing this great news with us!!!
>>
>> On Tue, May 14, 2019 at 11:01 PM Ahmet Altay  wrote:
>>
>>> Congratulations!
>>>
>>> On Tue, May 14, 2019 at 9:11 PM Robert Burke  wrote:
>>>
 Woohoo! Well deserved.

 On Tue, May 14, 2019, 8:34 PM Reuven Lax  wrote:

> Congratulations!
>
> *From: *Mikhail Gryzykhin 
> *Date: *Tue, May 14, 2019 at 8:32 PM
> *To: * 
>
> Congratulations Pablo!
>>
>> On Tue, May 14, 2019, 20:25 Kenneth Knowles  wrote:
>>
>>> Hi all,
>>>
>>> Please join me and the rest of the Beam PMC in welcoming Pablo
>>> Estrada to join the PMC.
>>>
>>> Pablo first picked up BEAM-722 in October of 2016 and has been a
>>> steady part of the Beam community since then. In addition to technical 
>>> work
>>> on Beam Python & Java & runners, I would highlight how Pablo grows 
>>> Beam's
>>> community by helping users, working on GSoC, giving talks at Beam 
>>> Summits
>>> and other OSS conferences including Flink Forward, and holding training
>>> workshops. I cannot do justice to Pablo's contributions in a single
>>> paragraph.
>>>
>>> Thanks Pablo, for being a part of Beam.
>>>
>>> Kenn
>>>
>>


Re: [ANNOUNCE] New PMC Member: Pablo Estrada

2019-05-15 Thread Michael Luckey
Congrats, Pablo!

On Wed, May 15, 2019 at 8:21 AM Connell O'Callaghan 
wrote:

> Awesome well done Pablo!!!
>
> Kenn thank you for sharing this great news with us!!!
>
> On Tue, May 14, 2019 at 11:01 PM Ahmet Altay  wrote:
>
>> Congratulations!
>>
>> On Tue, May 14, 2019 at 9:11 PM Robert Burke  wrote:
>>
>>> Woohoo! Well deserved.
>>>
>>> On Tue, May 14, 2019, 8:34 PM Reuven Lax  wrote:
>>>
 Congratulations!

 *From: *Mikhail Gryzykhin 
 *Date: *Tue, May 14, 2019 at 8:32 PM
 *To: * 

 Congratulations Pablo!
>
> On Tue, May 14, 2019, 20:25 Kenneth Knowles  wrote:
>
>> Hi all,
>>
>> Please join me and the rest of the Beam PMC in welcoming Pablo
>> Estrada to join the PMC.
>>
>> Pablo first picked up BEAM-722 in October of 2016 and has been a
>> steady part of the Beam community since then. In addition to technical 
>> work
>> on Beam Python & Java & runners, I would highlight how Pablo grows Beam's
>> community by helping users, working on GSoC, giving talks at Beam Summits
>> and other OSS conferences including Flink Forward, and holding training
>> workshops. I cannot do justice to Pablo's contributions in a single
>> paragraph.
>>
>> Thanks Pablo, for being a part of Beam.
>>
>> Kenn
>>
>


Re: [ANNOUNCE] New PMC Member: Pablo Estrada

2019-05-15 Thread Connell O'Callaghan
Awesome well done Pablo!!!

Kenn thank you for sharing this great news with us!!!

On Tue, May 14, 2019 at 11:01 PM Ahmet Altay  wrote:

> Congratulations!
>
> On Tue, May 14, 2019 at 9:11 PM Robert Burke  wrote:
>
>> Woohoo! Well deserved.
>>
>> On Tue, May 14, 2019, 8:34 PM Reuven Lax  wrote:
>>
>>> Congratulations!
>>>
>>> *From: *Mikhail Gryzykhin 
>>> *Date: *Tue, May 14, 2019 at 8:32 PM
>>> *To: * 
>>>
>>> Congratulations Pablo!

 On Tue, May 14, 2019, 20:25 Kenneth Knowles  wrote:

> Hi all,
>
> Please join me and the rest of the Beam PMC in welcoming Pablo Estrada
> to join the PMC.
>
> Pablo first picked up BEAM-722 in October of 2016 and has been a
> steady part of the Beam community since then. In addition to technical 
> work
> on Beam Python & Java & runners, I would highlight how Pablo grows Beam's
> community by helping users, working on GSoC, giving talks at Beam Summits
> and other OSS conferences including Flink Forward, and holding training
> workshops. I cannot do justice to Pablo's contributions in a single
> paragraph.
>
> Thanks Pablo, for being a part of Beam.
>
> Kenn
>



Re: [ANNOUNCE] New PMC Member: Pablo Estrada

2019-05-15 Thread Ahmet Altay
Congratulations!

On Tue, May 14, 2019 at 9:11 PM Robert Burke  wrote:

> Woohoo! Well deserved.
>
> On Tue, May 14, 2019, 8:34 PM Reuven Lax  wrote:
>
>> Congratulations!
>>
>> *From: *Mikhail Gryzykhin 
>> *Date: *Tue, May 14, 2019 at 8:32 PM
>> *To: * 
>>
>> Congratulations Pablo!
>>>
>>> On Tue, May 14, 2019, 20:25 Kenneth Knowles  wrote:
>>>
 Hi all,

 Please join me and the rest of the Beam PMC in welcoming Pablo Estrada
 to join the PMC.

 Pablo first picked up BEAM-722 in October of 2016 and has been a steady
 part of the Beam community since then. In addition to technical work on
 Beam Python & Java & runners, I would highlight how Pablo grows Beam's
 community by helping users, working on GSoC, giving talks at Beam Summits
 and other OSS conferences including Flink Forward, and holding training
 workshops. I cannot do justice to Pablo's contributions in a single
 paragraph.

 Thanks Pablo, for being a part of Beam.

 Kenn

>>>