from:"Aviem Zur"

Re: Documenting Metrics API?

2018-05-15 Thread Aviem Zur

There is an open task for this on JIRA:

https://issues.apache.org/jira/browse/BEAM-1974

On Tue, May 15, 2018 at 10:56 AM Etienne Chauchot 
wrote:

> Hi Pablo,
> I don't know if it is what you seek but we have at least this doc that is
> user facing but it is a bit old:
>
> https://docs.google.com/document/d/1voyUIQ2DrWkoY-BsJwM8YvF4gGKB76CDG8BYL8XBc7A/edit#heading=h.vv2fbulkp7t
>
> Etienne
>
> Le vendredi 11 mai 2018 à 15:00 -0700, Lukasz Cwik a écrit :
>
> The programming guide doesn't have the information your looking for. Like
> Kenn says it should have it but it currently doesn't.
>
> On Fri, May 11, 2018 at 1:02 PM Kenneth Knowles  wrote:
>
> I think the programming guide needs to have end-user documentation.
>
> Kenn
>
> On Fri, May 11, 2018 at 12:58 PM Lukasz Cwik  wrote:
>
> Are you speaking about metrics related to portability? If so, Alex shared
> this doc a while back: https://s.apache.org/beam-fn-api-metrics
>
> Otherwise, I'm not aware of any metrics related documentation for Apache
> Beam on the website.
>
>
>
> On Fri, May 11, 2018 at 12:02 PM Pablo Estrada  wrote:
>
> Hello all,
> I could not find a place were the Beam Metrics API is well detailed in the
> Beam website. Is there a JIRA tracking this? Perhaps our use-case driven
> docs + java/pydoc cover it well enough, but I'm not sure that that's the
> case.
>
> Thanks
> -P
> --
> Got feedback? go/pabloem-feedback
> 
>
>
>
>

Re: [ANNOUCEMENT] New Foundation members!

2018-03-30 Thread Aviem Zur

Congrats!

On Sat, Mar 31, 2018 at 2:30 AM Ahmet Altay  wrote:

> Congratulations to all of you!
>
>
> On Fri, Mar 30, 2018, 4:29 PM Pablo Estrada  wrote:
>
>> Congratulations y'all! Very cool.
>> Best
>> -P.
>>
>> On Fri, Mar 30, 2018 at 4:09 PM Davor Bonaci  wrote:
>>
>>> Now that this is public... please join me in welcoming three newly
>>> elected members of the Apache Software Foundation with ties to this
>>> community, who were elected during the most recent Members' Meeting.
>>>
>>> * Ismaël Mejía (Beam PMC)
>>>
>>> * Josh Wills (Crunch Chair; Beam, DataFu PMC)
>>>
>>> * Holden Karau (Spark, SystemML PMC; Mahout, Subversion committer; Beam
>>> contributor)
>>>
>>> These individuals demonstrated merit in Foundation's growth, evolution,
>>> and progress. They were recognized, nominated, and elected by existing
>>> membership for their significant impact to the Foundation as a whole, such
>>> as the roots of project-related and cross-project activities.
>>>
>>> As members, they now become legal owners and shareholders of the
>>> Foundation. They can vote for the Board, incubate new projects, nominate
>>> new members, participate in any PMC-private discussions, and contribute to
>>> any project.
>>>
>>> (For the Beam community, this election nearly doubles the number of
>>> Foundation members. The new members are joining Jean-Baptiste Onofré,
>>> Stephan Ewen, Romain Manni-Bucau and myself in this role.)
>>>
>>> I'm happy to be able to call all three of you my fellow members.
>>> Congratulations!
>>>
>>>
>>> Davor
>>>
>> --
>> Got feedback? go/pabloem-feedback
>>
>

Re: [INFO] Spark runner updated to Spark 2.2.1

2017-12-18 Thread Aviem Zur

Nice!

On Mon, Dec 18, 2017 at 12:51 PM Jean-Baptiste Onofré 
wrote:

> Hi all,
>
> We are pleased to announce that Spark 2.x support in Spark runner has been
> merged this morning. It supports Spark 2.2.1.
>
> In the same PR, we did update to Scala 2.11, including Flink artifacts
> update to
> 2.11 (it means it's already ready to upgrade to Flink 1.4 !).
>
> It also means, as planned, that Spark 2.x support will be included in next
> Beam
> 2.3.0 release.
>
> Now, we are going to work on improvements in the Spark runner.
>
> If you have any issue with the Spark runner, please let us know.
>
> Thanks !
> Regards
> JB
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: Hi

2017-10-03 Thread Aviem Zur

Added to contributors list, welcome aboard!

On Tue, Oct 3, 2017 at 12:45 PM Dennis Jung <inylov...@gmail.com> wrote:

> Hello,
> Thanks Aviem! I'll start with that.
>
> JIRA ID : djkooks
>
> BR,
> Dennis
>
> 2017-10-03 18:23 GMT+09:00 Aviem Zur <aviem...@gmail.com>:
>
> > Hi Dennis,
> >
> > You can take a look at the "Contribute to Beam" page on the website, and
> > most importantly the Contribution Guide https://beam.apache.org/
> > contribute/
> >
> > You can find open "starter" tasks on JIRA using the labels "starter" or
> > "newbie" like so:
> > https://issues.apache.org/jira/issues/?jql=project%20%
> > 3D%20BEAM%20AND%20status%20%3D%20open%20and%20labels%20in%
> > 20(starter%2C%20newbie
> > )
> >
> > Let us know your Apache JIRA username and we can add you as a contributor
> > so you can assign a task to yourself.
> >
> > On Tue, Oct 3, 2017 at 11:55 AM Dennis Jung <inylov...@gmail.com> wrote:
> >
> > > Hello,
> > > I'm Dennis Jung, working as SW engineer(usually Java, Python, JS) in
> > Japan.
> > > I also hope to contribute on BEAM project.
> > > But though I'm used to in Github, it is first time joining on Apache
> > > project.
> > >
> > > Is there some kind of list of simple bug fix, or good guide for
> > > contribution?
> > > Hope this is not a silly question.
> > >
> > > Thanks.
> > >
> > > 2017-10-03 17:29 GMT+09:00 Yonatan Seneor <ysen...@gmail.com>:
> > >
> > > > thanks :)
> > > >
> > > > On Tue, Oct 3, 2017 at 11:03 AM, Reuven Lax <re...@google.com.invalid
> >
> > > > wrote:
> > > >
> > > > > Welcome!
> > > > >
> > > > > On Mon, Oct 2, 2017 at 4:26 AM, Yonatan Seneor <ysen...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > My name isYoni Seneor I am A devOps engineer  at PayPal, I am a
> > part
> > > > of a
> > > > > > Team that uses Beam and I would like to contribute to the
> Project.
> > > > > > Please add me to the contributor list on JIRA.
> > > > > >
> > > > > > Thanks
> > > > > > Yoni Seneor
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Hi

2017-10-03 Thread Aviem Zur

Hi Dennis,

You can take a look at the "Contribute to Beam" page on the website, and
most importantly the Contribution Guide https://beam.apache.org/contribute/

You can find open "starter" tasks on JIRA using the labels "starter" or
"newbie" like so:
https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20status%20%3D%20open%20and%20labels%20in%20(starter%2C%20newbie
)

Let us know your Apache JIRA username and we can add you as a contributor
so you can assign a task to yourself.

On Tue, Oct 3, 2017 at 11:55 AM Dennis Jung  wrote:

> Hello,
> I'm Dennis Jung, working as SW engineer(usually Java, Python, JS) in Japan.
> I also hope to contribute on BEAM project.
> But though I'm used to in Github, it is first time joining on Apache
> project.
>
> Is there some kind of list of simple bug fix, or good guide for
> contribution?
> Hope this is not a silly question.
>
> Thanks.
>
> 2017-10-03 17:29 GMT+09:00 Yonatan Seneor :
>
> > thanks :)
> >
> > On Tue, Oct 3, 2017 at 11:03 AM, Reuven Lax 
> > wrote:
> >
> > > Welcome!
> > >
> > > On Mon, Oct 2, 2017 at 4:26 AM, Yonatan Seneor 
> > wrote:
> > >
> > > > My name isYoni Seneor I am A devOps engineer  at PayPal, I am a part
> > of a
> > > > Team that uses Beam and I would like to contribute to the Project.
> > > > Please add me to the contributor list on JIRA.
> > > >
> > > > Thanks
> > > > Yoni Seneor
> > > >
> > >
> >
>

Re: Hi

2017-10-03 Thread Aviem Zur

Srinivas you have been added.
Welcome aboard!

On Tue, Oct 3, 2017 at 10:27 AM Srinivas Reddy 
wrote:

> Hi,
>
> I am Srinivas, Sr Data Engineer at Kogentix. I would like to contribute to
> beam project.
>
> Could you add me to contributor list. My JIRA username is : mrsrinivas
>
>
> On 3 October 2017 at 05:33, Lukasz Cwik  wrote:
>
> > You have been added. Welcome.
> >
> > On Mon, Oct 2, 2017 at 2:15 AM, Yonatan Seneor 
> wrote:
> >
> > > Hi
> > > My username  on Apache JIRA is: yseneor
> > > Thanks
> > > Yoni Seneor
> > >
> > >
> > > ‫בתאריך יום ב׳, 2 באוק׳ 2017 ב-11:26 מאת ‪Yonatan Seneor‬‏ <‪
> > > ysen...@gmail.com‬‏>:‬
> > >
> > > > My name isYoni Seneor I am A devOps engineer  at PayPal, I am a part
> > of a
> > > > Team that uses Beam and I would like to contribute to the Project.
> > > > Please add me to the contributor list on JIRA.
> > > >
> > > > Thanks
> > > > Yoni Seneor
> > > >
> > >
> >
>

Re: Hi

2017-10-03 Thread Aviem Zur

Welcome Yoni

On Tue, Oct 3, 2017 at 3:03 AM Lukasz Cwik  wrote:

> You have been added. Welcome.
>
> On Mon, Oct 2, 2017 at 2:15 AM, Yonatan Seneor  wrote:
>
> > Hi
> > My username  on Apache JIRA is: yseneor
> > Thanks
> > Yoni Seneor
> >
> >
> > ‫בתאריך יום ב׳, 2 באוק׳ 2017 ב-11:26 מאת ‪Yonatan Seneor‬‏ <‪
> > ysen...@gmail.com‬‏>:‬
> >
> > > My name isYoni Seneor I am A devOps engineer  at PayPal, I am a part
> of a
> > > Team that uses Beam and I would like to contribute to the Project.
> > > Please add me to the contributor list on JIRA.
> > >
> > > Thanks
> > > Yoni Seneor
> > >
> >
>

Re: Contributor introduction

2017-10-02 Thread Aviem Zur

Added, welcome Uri!

On Mon, Oct 2, 2017 at 3:32 PM Jean-Baptiste Onofré  wrote:

> Hi Uri,
>
> what's your Jira ID ?
>
> Thanks,
> Regards
> JB
>
> On 10/02/2017 02:31 PM, Uri Silberstein wrote:
> > Hi all,
> >
> > My name is Uri Silberstein and I am part of a PayPal team that works with
> > Beam.
> >
> > I would like to contribute to the Project.
> >
> > Please add me to the contributor list on JIRA, so I can assign to myself
> a
> > task that I've just opened.
> >
> > Thanks,
> >
> > Uri Silberstein
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: new guy

2017-08-31 Thread Aviem Zur

Welcome JB #2!

Glad to have you on board.

On Tue, Aug 29, 2017 at 5:38 PM Joey Baruch  wrote:

> my jira username is joeyfezster
>
> thanks
>
> On Tue, Aug 29, 2017 at 4:12 PM Jean-Baptiste Onofré 
> wrote:
>
> > Welcome !
> >
> > What's your apache id ?
> >
> > Regards
> > JB
> >
> > On 08/29/2017 02:57 PM, Joey Baruch wrote:
> > > Hey everyone,
> > >
> > > Apache Beam looks like a pretty exciting new project, and I'd love to
> > > contribute to it.
> > > I'm a relatively fresh developer, but i'm looking learn by doing.
> > >
> > > Would appreciate to be added as a contributor on jira.
> > >
> > > Thanks!
> > > Joey
> > >
> >
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>

Re: kafka docs

2017-08-29 Thread Aviem Zur

Hi Joey.

This would be great. Also, KafkaIO requires a specific dependency to be
added (beam-sdks-java-io-kafka), we should probably put that as a maven
snippet in the README as well. Feel free to create a PR with this README on
GitHub.

Regarding the long series of links you need to click in the site in order
to get to these docs, I agree, perhaps we can reduce this number of clicks,
feel free to create a PR on the beam-site repo on GitHub.

Welcome to the community!

On Tue, Aug 29, 2017 at 11:33 AM Jean-Baptiste Onofré 
wrote:

> Agree to add the link to javadoc on the I/O list:
>
> https://beam.apache.org/documentation/io/built-in/
>
> Regards
> JB
>
> On 08/29/2017 10:28 AM, Joey Baruch wrote:
> > Hey all,
> >
> > As a new user trying to use a kafkaIO source/sink i couldn't find any
> > documentation easily.
> > The documentation page <
> https://beam.apache.org/documentation/io/built-in/>,
> > (which you get to from headder -> doccumentaton -> pipeline i/o -> built
> in
> > i/o transforms) leads to the kafka class
> > , but
> there
> > is no docs there.
> >
> > I will add a simple readme that points to the classe's javadocs.
> >
> > regards
> > Joey Baruch
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: Policy for stale PRs

2017-08-16 Thread Aviem Zur

Makes sense to close after a long time of inactivity and no response, and
as Kenn mentioned they can always re-open.

On Wed, Aug 16, 2017 at 12:20 AM Jean-Baptiste Onofré 
wrote:

> If we consider the author, it makes sense.
>
> Regards
> JB
>
> On Aug 15, 2017, 01:29, at 01:29, Ted Yu  wrote:
> >The proposal makes sense.
> >
> >If the author of PR doesn't respond for 90 days, the PR is likely out
> >of
> >sync with current repo.
> >
> >Cheers
> >
> >On Mon, Aug 14, 2017 at 5:27 PM, Ahmet Altay 
> >wrote:
> >
> >> Hi all,
> >>
> >> Do we have an existing policy for handling stale PRs? If not could we
> >come
> >> up with one. We are getting close to 100 open PRs. Some of the open
> >PRs
> >> have not been touched for a while, and if we exclude the pings the
> >number
> >> will be higher.
> >>
> >> For example, we could close PRs that have not been updated by the
> >original
> >> author for 90 days even after multiple attempts to reach them (e.g.
> >[1],
> >> [2] are such PRs.)
> >>
> >> What do you think?
> >>
> >> Thank you,
> >> Ahmet
> >>
> >> [1] https://github.com/apache/beam/pull/1464
> >> [2] https://github.com/apache/beam/pull/2949
> >>
>

Re: [ANNOUNCEMENT] New committers, August 2017 edition!

2017-08-15 Thread Aviem Zur

Congrats!

On Mon, Aug 14, 2017 at 6:43 PM Tyler Akidau 
wrote:

> Congrats and thanks all around!
>
> On Sat, Aug 12, 2017 at 12:09 AM Aljoscha Krettek 
> wrote:
>
> > Congrats, everyone! It's well deserved.
> >
> > Best,
> > Aljoscha
> >
> > > On 12. Aug 2017, at 08:06, Pei HE  wrote:
> > >
> > > Congratulations to all!
> > > --
> > > Pei
> > >
> > > On Sat, Aug 12, 2017 at 10:50 AM, James  wrote:
> > >
> > >> Thank you guys, glad to contribute to this great project, congratulate
> > to
> > >> all the new committers!
> > >>
> > >> On Sat, Aug 12, 2017 at 8:36 AM Manu Zhang 
> > >> wrote:
> > >>
> > >>> Thanks everyone !!! It's a great journey.
> > >>> Congrats to other new committers !
> > >>>
> > >>> Thanks,
> > >>> Manu
> > >>>
> > >>> On Sat, Aug 12, 2017 at 5:23 AM Jean-Baptiste Onofré <
> j...@nanthrax.net>
> > >>> wrote:
> > >>>
> >  Congrats and welcome !
> > 
> >  Regards
> >  JB
> > 
> >  On 08/11/2017 07:40 PM, Davor Bonaci wrote:
> > > Please join me and the rest of Beam PMC in welcoming the following
> > > contributors as our newest committers. They have significantly
> >  contributed
> > > to the project in different ways, and we look forward to many more
> > > contributions in the future.
> > >
> > > * Reuven Lax
> > > Reuven has been with the project since the very beginning,
> > >> contributing
> > > mostly to the core SDK and the GCP IO connectors. He accumulated 52
> >  commits
> > > (19,824 ++ / 12,039 --). Most recently, Reuven re-wrote several IO
> > > connectors that significantly expanded their functionality.
> > >>> Additionally,
> > > Reuven authored important new design documents relating to update
> and
> > > snapshot functionality.
> > >
> > > * Jingsong Lee
> > > Jingsong has been contributing to Apache Beam since the beginning
> of
> > >>> the
> > > year, particularly to the Flink runner. He has accumulated 34
> commits
> > > (11,214 ++ / 6,314 --) of deep, fundamental changes that
> > >> significantly
> > > improved the quality of the runner. Additionally, Jingsong has
> >  contributed
> > > to the project in other ways too -- reviewing contributions, and
> > > participating in discussions on the mailing list, design documents,
> > >> and
> > > JIRA issue tracker.
> > >
> > > * Mingmin Xu
> > > Mingmin started the SQL DSL effort, and has driven it to the point
> of
> > > merging to the master branch. In this effort, he extended the
> project
> > >>> to
> > > the significant new user community.
> > >
> > > * Mingming (James) Xu
> > > James joined the SQL DSL effort, contributing some of the trickier
> > >>> parts,
> > > such as the Join functionality. Additionally, he's consistently
> shown
> > > himself to be an insightful code reviewer, significantly impacting
> > >> the
> > > project’s code quality and ensuring the success of the new major
> >  component.
> > >
> > > * Manu Zhang
> > > Manu initiated and developed a runner for the Apache Gearpump
> >  (incubating)
> > > engine, and has driven it to the point of merging to the master
> > >> branch.
> >  In
> > > this effort, he accumulated 65 commits (7,812 ++ / 4,882 --) and
> > >>> extended
> > > the project to the new user community.
> > >
> > > Congratulations to all five! Welcome!
> > >
> > > Davor
> > >
> > 
> >  --
> >  Jean-Baptiste Onofré
> >  jbono...@apache.org
> >  http://blog.nanthrax.net
> >  Talend - http://www.talend.com
> > 
> > >>>
> > >>
> >
> >
>

Re: [CANCEL][VOTE] Release 2.1.0, release candidate #2

2017-07-24 Thread Aviem Zur

We also have two tests failing in Spark runner as detailed by the following
two tickets:
https://issues.apache.org/jira/browse/BEAM-2670
https://issues.apache.org/jira/browse/BEAM-2671

On Mon, Jul 24, 2017 at 11:44 AM Jean-Baptiste Onofré 
wrote:

> Hi all,
>
> due to https://issues.apache.org/jira/browse/BEAM-2662, I cancel this
> vote.
>
> We also have a build issue with the Spark runner that I would like to fix
> for RC3:
>
>
> https://builds.apache.org/view/Beam/job/beam_PostCommit_Java_ValidatesRunner_Spark/2446/
>
> So, we are going to work on the Spark runner test fix for RC3 (BEAM-2662 is
> already fixed on release-2.1.0 branch).
>
> I will submit RC3 to vote as soon as Spark runner tests are fully OK.
>
> Regards
> JB
>
> On 07/18/2017 06:30 PM, Jean-Baptiste Onofré wrote:
> > Hi everyone,
> >
> > Please review and vote on the release candidate #2 for the version
> 2.1.0, as
> > follows:
> >
> > [ ] +1, Approve the release
> > [ ] -1, Do not approve the release (please provide specific comments)
> >
> >
> > The complete staging area is available for your review, which includes:
> > * JIRA release notes [1],
> > * the official Apache source release to be deployed to dist.apache.org
> [2],
> > which is signed with the key with fingerprint C8282E76 [3],
> > * all artifacts to be deployed to the Maven Central Repository [4],
> > * source code tag "v2.1.0-RC2" [5],
> > * website pull request listing the release and publishing the API
> reference
> > manual [6].
> > * Python artifacts are deployed along with the source release to the
> > dist.apache.org [2].
> >
> > The vote will be open for at least 72 hours. It is adopted by majority
> approval,
> > with at least 3 PMC affirmative votes.
> >
> > Thanks,
> > JB
> >
> > [1]
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12340528
> >
> > [2] https://dist.apache.org/repos/dist/dev/beam/2.1.0/
> > [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> > [4]
> https://repository.apache.org/content/repositories/orgapachebeam-1019/
> > [5] https://github.com/apache/beam/tree/v2.1.0-RC2
> > [6] https://github.com/apache/beam-site/pull/270
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: [VOTE] Release 2.1.0, release candidate #2

2017-07-20 Thread Aviem Zur

Thanks Kenn for the info,
+1 that this should be included in a release verification guide.

On Thu, Jul 20, 2017 at 2:07 PM Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

> Hi Aljoscha
>
> Do you have all python requirements installed on your machine ?
>
> Especially, pip, setuptools, tox, ... ?
>
> It sounds like a missing Python requirement on your machine to me.
>
> Regards
> JB
>
> On 07/20/2017 10:36 AM, Aljoscha Krettek wrote:
> > + 0.8
> >
> > I tried running "mvn package” on my machine build it fails. This is the
> log output:
> https://gist.github.com/aljoscha/dc194303bede8bc635e2d8b691bb58f8 <
> https://gist.github.com/aljoscha/dc194303bede8bc635e2d8b691bb58f8>. It
> fails when trying to build the Python part. Unfortunately I know almost
> nothing about this but one thing that caught my attention was this snipped
> "RuntimeError: Not in apache git tree; unable to find proto definitions.”.
> Could this be a problem?
> >
> > Otherwise I’d say the RC is good and I did this:
> >
> >   - verified the checksums and signature
> >   - checked that LICENSE and NOTICE are present
> >   - used the staged artefacts to create a Quickstart project
> >   - checked that the compiled Quickstart works with a Flink 1.2.1
> cluster in batch and streaming mode
> >
> >> On 19. Jul 2017, at 19:48, Kenneth Knowles <k...@google.com.INVALID>
> wrote:
> >>
> >> +1 to the RC
> >>
> >> Relating to Aviem's question, I think we need a release verification
> guide,
> >> at the least as a section of the Release Guide. But if we follow
> through on
> >> the prior thread of having a validation matrix with manual steps people
> >> sign up for, that is even better, and saves repeated work.
> >>
> >> As notes towards this, below this email I have included the steps to go
> >> through the Java quickstart with the RC on DirectRunner and
> DataflowRunner,
> >> which I have done. It should be also easy for other runners.
> >>
> >> I think the full ValidatesRunner suite run against the release-2.1.0
> branch
> >> is adequate rather than trying to run them all against the RC, since we
> are
> >> not trying (necessarily) to testing the release plugin. I wouldn't
> _oppose_
> >> an easy way to run the full suite... (maybe just a fancy command line?)
> >>
> >> Kenn
> >>
> >>
> >> 
> >>   
> >> beam-2.1.0
> >>   
> >>   
> >> 
> >>   beam-2.1.0
> >>   
> >> 
> >>   beam-2.1.0
> >>   
> >> https://repository.apache.org/content/repositories/orgapachebeam-1019/
> 
> >> 
> >> 
> >>   archetype
> >>   
> >> https://repository.apache.org/content/repositories/orgapachebeam-1019/
> 
> >> 
> >>   
> >> 
> >>   
> >> 
> >>
> >> mvn archetype:generate \
> >>   --settings settings.xml \
> >>   -P beam-2.1.0 \
> >>   -D archetypeGroupId=org.apache.beam \
> >>   -D archetypeArtifactId=beam-sdks-java-maven-archetypes-examples \
> >>   -D archetypeVersion=2.1.0 \
> >>   -D groupId=org.example \
> >>   -D artifactId=wordcountbeam \
> >>   -D version="0.1" \
> >>   -D package=org.apache.beam.examples \
> >>   -D interactiveMode=false
> >>
> >> cd wordcountbeam
> >>
> >> mvn compile exec:java \
> >> --settings ../settings.xml \
> >> -Pdirect-runner \
> >> -D exec.mainClass=org.apache.beam.examples.WordCount \
> >> -D exec.args="--inputFile=pom.xml --output=counts"
> >>
> >>
> >> mvn compile exec:java \
> >> --settings ../settings.xml \
> >> -P dataflow-runner \
> >> -D exec.mainClass=org.apache.beam.examples.WordCount \
> >> -D exec.args="--runner=DataflowRunner --project=
> >> --gcpTempLocation=gs:///tmp
> >> --inputFile=gs://apache-beam-samples/shakespeare/*
> >> --output=gs:///counts"
> >>
> >> On Wed, Jul 19, 2017 at 10:19 AM, Jean-Baptiste Onofré <j...@nanthrax.net
> >
> >> wrote:
> >>
> >>> I don't understand as all jars are on the Nexus staging repository.
> >>> The zip are also on staging repository.
> >>>
> >>> Regards
> >>> J

Re: [VOTE] Release 2.1.0, release candidate #2

2017-07-19 Thread Aviem Zur

@JB

Hi, yes I saw that link, however those appear to be just the sources, not
jars.
Do we have built RC jars us to validate which are then deployed as is to
dist (renamed to remove -RC and so forth) or do we each compile these
manually and are assured that the sources in the dist are the actual ones
that the final jars will be built from?

On Wed, Jul 19, 2017 at 7:16 PM Ahmet Altay <al...@google.com.invalid>
wrote:

> Yes, +1 on RC2.
>
> On Wed, Jul 19, 2017 at 5:10 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
> wrote:
>
> > Hi Aviem,
> >
> > as mentioned in the first e-mail:
> >
> > - Distributions are available here:
> > https://dist.apache.org/repos/dist/dev/beam/2.1.0/
> >
> > - Artifacts are on the staging repository:
> > https://repository.apache.org/content/repositories/orgapachebeam-1019/
> >
> > Regards
> > JB
> >
> >
> > On 07/19/2017 12:26 PM, Aviem Zur wrote:
> >
> >> Have the jars for RC2 been uploaded somewhere?
> >>
> >> On Wed, Jul 19, 2017 at 10:19 AM Jean-Baptiste Onofré <j...@nanthrax.net>
> >> wrote:
> >>
> >> So, I guess you are voting +1 on RC2, correct (just for the tracking) ?
> >>>
> >>> Thanks,
> >>> Regards
> >>> JB
> >>>
> >>> On 07/19/2017 08:00 AM, Ahmet Altay wrote:
> >>>
> >>>> Thank you JB.
> >>>>
> >>>> I validated python wordcount and mobile gaming examples on Linux.
> Found
> >>>>
> >>> one
> >>>
> >>>> issue (https://issues.apache.org/jira/browse/BEAM-2636). This does
> not
> >>>>
> >>> need
> >>>
> >>>> to be a blocking issue for RC2, but if we end up having a RC3 we
> should
> >>>> consider fixing this issue.
> >>>>
> >>>> Ahmet
> >>>>
> >>>> On Tue, Jul 18, 2017 at 4:18 PM, Mingmin Xu <mingm...@gmail.com>
> wrote:
> >>>>
> >>>> Thanks Kenn. SQL DSL should be ready in the next version 2.2.0, and
> >>>>>
> >>>> agree
> >>>
> >>>> to have an overall row "Add SQL DSL" instead of listing all the
> detailed
> >>>>> tasks.
> >>>>>
> >>>>> On Tue, Jul 18, 2017 at 3:54 PM, Kenneth Knowles
> >>>>> <k...@google.com.invalid
> >>>>>
> >>>>
> >>>> wrote:
> >>>>>
> >>>>> Done.
> >>>>>>
> >>>>>> Since it is all on a feature branch and the release notes when it
> goes
> >>>>>>
> >>>>> to
> >>>
> >>>> master will include "Add SQL DSL" I did not associate the little bits
> >>>>>>
> >>>>> with
> >>>>>
> >>>>>> a release.
> >>>>>>
> >>>>>> On Tue, Jul 18, 2017 at 2:51 PM, Mingmin Xu <mingm...@gmail.com>
> >>>>>>
> >>>>> wrote:
> >>>
> >>>>
> >>>>>> The tasks of SQL should not be labeled as 2.1.0, I've updated some
> >>>>>>>
> >>>>>> with
> >>>
> >>>> 2.2.0, fail to change the 'closed' ones. Can anyone with the
> >>>>>>>
> >>>>>> permission
> >>>
> >>>> update these tasks
> >>>>>>> https://issues.apache.org/jira/browse/BEAM-2171?jql=
> >>>>>>> project%20%3D%20BEAM%20AND%20fixVersion%20%3D%202.1.0%
> >>>>>>> 20AND%20component%20%3D%20dsl-sql?
> >>>>>>>
> >>>>>>>
> >>>>>>> Thanks!
> >>>>>>> Mingmin
> >>>>>>>
> >>>>>>> On Tue, Jul 18, 2017 at 2:23 PM, Jean-Baptiste Onofré <
> >>>>>>>
> >>>>>> j...@nanthrax.net
> >>>
> >>>>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> Yeah, indeed, the issue like BEAM-2171 should not have "Fix
> Version"
> >>>>>>>>
> >>>>>>> set
> >>>>>>
> >>>>>>> to 2.1.0.
> >>>>>>>>
> >>>>>>>> Regards
> >>>>>>>

Re: [VOTE] Release 2.1.0, release candidate #2

2017-07-19 Thread Aviem Zur

Have the jars for RC2 been uploaded somewhere?

On Wed, Jul 19, 2017 at 10:19 AM Jean-Baptiste Onofré 
wrote:

> So, I guess you are voting +1 on RC2, correct (just for the tracking) ?
>
> Thanks,
> Regards
> JB
>
> On 07/19/2017 08:00 AM, Ahmet Altay wrote:
> > Thank you JB.
> >
> > I validated python wordcount and mobile gaming examples on Linux. Found
> one
> > issue (https://issues.apache.org/jira/browse/BEAM-2636). This does not
> need
> > to be a blocking issue for RC2, but if we end up having a RC3 we should
> > consider fixing this issue.
> >
> > Ahmet
> >
> > On Tue, Jul 18, 2017 at 4:18 PM, Mingmin Xu  wrote:
> >
> >> Thanks Kenn. SQL DSL should be ready in the next version 2.2.0, and
> agree
> >> to have an overall row "Add SQL DSL" instead of listing all the detailed
> >> tasks.
> >>
> >> On Tue, Jul 18, 2017 at 3:54 PM, Kenneth Knowles  >
> >> wrote:
> >>
> >>> Done.
> >>>
> >>> Since it is all on a feature branch and the release notes when it goes
> to
> >>> master will include "Add SQL DSL" I did not associate the little bits
> >> with
> >>> a release.
> >>>
> >>> On Tue, Jul 18, 2017 at 2:51 PM, Mingmin Xu 
> wrote:
> >>>
>  The tasks of SQL should not be labeled as 2.1.0, I've updated some
> with
>  2.2.0, fail to change the 'closed' ones. Can anyone with the
> permission
>  update these tasks
>  https://issues.apache.org/jira/browse/BEAM-2171?jql=
>  project%20%3D%20BEAM%20AND%20fixVersion%20%3D%202.1.0%
>  20AND%20component%20%3D%20dsl-sql?
> 
> 
>  Thanks!
>  Mingmin
> 
>  On Tue, Jul 18, 2017 at 2:23 PM, Jean-Baptiste Onofré <
> j...@nanthrax.net
> >>>
>  wrote:
> 
> > Yeah, indeed, the issue like BEAM-2171 should not have "Fix Version"
> >>> set
> > to 2.1.0.
> >
> > Regards
> > JB
> >
> > On 07/18/2017 06:52 PM, James wrote:
> >
> >> Just noticed that some of the DSL_SQL issues are included in this
>  release?
> >> e.g. The first one: BEAM-2171, this is not expected,right?
> >> On Wed, 19 Jul 2017 at 12:30 AM Jean-Baptiste Onofré <
> >> j...@nanthrax.net
> 
> >> wrote:
> >>
> >> Hi everyone,
> >>>
> >>> Please review and vote on the release candidate #2 for the version
>  2.1.0,
> >>> as
> >>> follows:
> >>>
> >>> [ ] +1, Approve the release
> >>> [ ] -1, Do not approve the release (please provide specific
> >> comments)
> >>>
> >>>
> >>> The complete staging area is available for your review, which
> >>> includes:
> >>> * JIRA release notes [1],
> >>> * the official Apache source release to be deployed to
> >>> dist.apache.org
> >>> [2],
> >>> which is signed with the key with fingerprint C8282E76 [3],
> >>> * all artifacts to be deployed to the Maven Central Repository [4],
> >>> * source code tag "v2.1.0-RC2" [5],
> >>> * website pull request listing the release and publishing the API
> >>> reference
> >>> manual [6].
> >>> * Python artifacts are deployed along with the source release to
> >> the
> >>> dist.apache.org [2].
> >>>
> >>> The vote will be open for at least 72 hours. It is adopted by
> >>> majority
> >>> approval,
> >>> with at least 3 PMC affirmative votes.
> >>>
> >>> Thanks,
> >>> JB
> >>>
> >>> [1]
> >>>
> >>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?proje
> >>> ctId=12319527=12340528
> >>> [2] https://dist.apache.org/repos/dist/dev/beam/2.1.0/
> >>> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> >>> [4] https://repository.apache.org/content/repositories/orgapache
> >>> beam-1019/
> >>> [5] https://github.com/apache/beam/tree/v2.1.0-RC2
> >>> [6] https://github.com/apache/beam-site/pull/270
> >>>
> >>>
> >>
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
> 
> 
> 
>  --
>  
>  Mingmin
> 
> >>>
> >>
> >>
> >>
> >> --
> >> 
> >> Mingmin
> >>
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: [DISCUSS] Bridge beam metrics to underlying runners to support metrics reporters?

2017-06-22 Thread Aviem Zur

Hi Cody,

Some of the runners have their own metrics sink, for example Spark runner
uses Spark's metrics sink which you can configure to send the metrics to
backends such as Graphite.

There have been ideas floating around for a Beam metrics sink extension
which will allow users to send Beam metrics to various metrics backends, I
believe @JB is working on something along these lines.

On Thu, Jun 22, 2017 at 2:00 PM Cody Innowhere  wrote:

> Hi guys,
> Currently metrics are implemented in runners/core as CounterCell,
> GaugeCell, DistributionCell, etc. If we want to send metrics to external
> systems via metrics reporter, we would have to define another set of
> metrics, say, codahale metrics, and update codahale metrics periodically
> with beam sdk metrics, which is inconvenient and inefficient.
>
> Another problem is that Meter/Histogram cannot be updated directly in this
> way because their internal data decays after time.
>
> My opinion would be bridge beam sdk metrics to underlying runners so that
> updates would directly apply to underlying runners (Flink, Spark, etc)
> without conversion.
>
> Specifically, currently we already delegate
> Metrics.counter/gauge/distribution to DelegatingCounter/Gauge/Distribution,
> which uses MetricsContainer to store the actual metrics with the
> implementation of MetricsContainerImpl. If we can add an API in
> MetricsEnvironment to allow runners to override the default implementation,
> say, for flink, we have FlinkMetricsContainerImpl, then all metric updates
> will directly apply to metrics in FlinkMetricsContainerImpl without
> intermediate conversion and updates. And since the metrics are
> runner-specific, it would be a lot easier to support metrics reporters as
> well as Meters/Histograms.
>
> What do you think?
>

Re: low availability in the coming 4 weeks

2017-05-30 Thread Aviem Zur

Congratulations!

On Fri, May 26, 2017 at 9:21 AM Kenneth Knowles 
wrote:

> Congrats!
>
> On Thu, May 25, 2017 at 2:00 PM, Raghu Angadi 
> wrote:
>
> > Congrats Mingmin. All the best!
> >
> > On Wed, May 24, 2017 at 8:33 PM, Mingmin Xu  wrote:
> >
> > > Hello everyone,
> > >
> > > I'll take 4 weeks off to take care of my new born baby. I'm very glad
> > that
> > > James Xu agrees to take my role in Beam SQL feature.
> > >
> > > Ps, I'll consolidate the PR for BEAM-2010 soon before that.
> > >
> > > Thank you!
> > > 
> > > Mingmin
> > >
> >
>

Re: First stable release completed!

2017-05-17 Thread Aviem Zur

Awesome! Now let's make Beam the standard in data processing.

On Thu, May 18, 2017 at 5:05 AM Jason Kuster 
wrote:

> Fantastic work everyone! I'm really excited to see what we've accomplished,
> and the future for Beam looks bright.
>
> On Wed, May 17, 2017 at 2:00 PM, Mark Liu 
> wrote:
>
> > Congratulations!
> >
> > On Wed, May 17, 2017 at 1:25 PM, Ismaël Mejía  wrote:
> >
> > > Amazing milestone, congrats everyone!
> > >
> > > On Wed, May 17, 2017 at 7:54 PM, Reuven Lax 
> > > wrote:
> > > > Sweet!
> > > >
> > > > On Wed, May 17, 2017 at 4:28 AM, Davor Bonaci 
> > wrote:
> > > >
> > > >> The first stable release is now complete!
> > > >>
> > > >> Release artifacts are available through various repositories,
> > including
> > > >> dist.apache.org, Maven Central, and PyPI. The website is updated,
> and
> > > >> announcements are published.
> > > >>
> > > >> Apache Software Foundation press release:
> > > >> http://globenewswire.com/news-release/2017/05/17/986839/0/
> > > >> en/The-Apache-Software-Foundation-Announces-Apache-Beam-v2-0-0.html
> > > >>
> > > >> Beam blog:
> > > >> https://beam.apache.org/blog/2017/05/17/beam-first-stable-
> > release.html
> > > >>
> > > >> Congratulations to everyone -- this is a really big milestone for
> the
> > > >> project, and I'm proud to be a part of this great community.
> > > >>
> > > >> Davor
> > > >>
> > >
> >
>
>
>
> --
> ---
> Jason Kuster
> Apache Beam / Google Cloud Dataflow
>

Re: Website homepage visual refresh

2017-05-16 Thread Aviem Zur

Cool!

On Wed, May 17, 2017 at 12:33 PM Mark Liu 
wrote:

> This is awesome! thanks Jeremy.
>
> On Tue, May 16, 2017 at 10:49 AM, Sourabh Bajaj <
> sourabhba...@google.com.invalid> wrote:
>
> > +1 this is great.
> >
> > On Tue, May 16, 2017 at 10:18 AM Jesse Anderson <
> je...@bigdatainstitute.io
> > >
> > wrote:
> >
> > > Nice work!
> > >
> > > On Tue, May 16, 2017 at 10:09 AM Davor Bonaci 
> wrote:
> > >
> > > > I think it is great too -- since it is an obvious improvement, let's
> > > merge
> > > > and iterate!
> > > >
> > > > On Tue, May 16, 2017 at 6:06 AM, Jean-Baptiste Onofré <
> j...@nanthrax.net
> > >
> > > > wrote:
> > > >
> > > > > Hi Jeremy,
> > > > >
> > > > > great job ! I like the new look'n feel.
> > > > >
> > > > > Thanks !
> > > > > Regards
> > > > > JB
> > > > >
> > > > >
> > > > > On 05/16/2017 07:44 AM, Jeremy Weinstein wrote:
> > > > >
> > > > >> Hi Beam community! fran...@apache.org and I have been working on
> a
> > > > >> project
> > > > >> to refresh the visual design of the Beam website. We have the
> > > following
> > > > >> few
> > > > >> goals:
> > > > >>
> > > > >> a) Breathe some life into the website homepage
> > > > >> b) Simplify and clean up the project's CSS and various supporting
> > > files
> > > > >> c) Make it a little more fun and engaging for new developers to
> > start
> > > > >> learning about Beam and enter into the content
> > > > >> d) Help explain Beam to passive and interested non-users
> > > > >>
> > > > >> I'd like the community's help on a few things.
> > > > >>
> > > > >> 1) First and foremost, any feedback on the design update is
> welcome.
> > > > >> 2) Secondly, there is a section on the homepage for
> > > testimonials/quotes
> > > > >> from Beam users and/or organizations about their usage of Beam. We
> > > could
> > > > >> set this up on a rotational basis to cycle through quotes, but to
> > > start,
> > > > >> if
> > > > >> anyone knows of any good quotes, posts, or tweets about Beam, I'd
> > like
> > > > to
> > > > >> source those and place them into the "A collaborative effort"
> > section.
> > > > >> Please send them over to me and I can flow them into the build.
> > > > >>
> > > > >> We're hoping to refresh the site before or soon after the first
> > stable
> > > > >> release. For this first pass we've focused on the main landing
> page,
> > > but
> > > > >> next up we'd like to improve several of the inside pages, as well
> as
> > > > >> update
> > > > >> the code toggles, and simplify a bit of the navigational
> structure.
> > > > >>
> > > > >> Sending this PR [1] out now as an FYI and to solicit feedback.
> We'll
> > > > make
> > > > >> a
> > > > >> few more improvements based on suggestions, as well as a few
> tweaks
> > to
> > > > >> TODOs in the header and footer. Feedback is welcome - thanks
> > everyone!
> > > > >>
> > > > >> [1] https://github.com/apache/beam-site/pull/244 +
> > > > >> http://apache-beam-website-pull-requests.storage.googleapis.
> > > > >> com/244/index.html
> > > > >>
> > > > >>
> > > > > --
> > > > > Jean-Baptiste Onofré
> > > > > jbono...@apache.org
> > > > > http://blog.nanthrax.net
> > > > > Talend - http://www.talend.com
> > > > >
> > > >
> > > --
> > > Thanks,
> > >
> > > Jesse
> > >
> >
>

Re: Process for getting the first stable release out

2017-05-05 Thread Aviem Zur

+1.

A document similar to the one we had for the Hackathon could serve us here
again.
A section for acceptance criteria compiled by the community and a matrix of
tests per runner to be filled for each RC version could help us synchronize
and get there.

On Fri, May 5, 2017 at 10:42 PM Dan Halperin  wrote:

> I am +1 on cutting the branch, and the sentiment that we expect the first
> pancake
> 
> will
> be not ready to serve customers.
>
> On Fri, May 5, 2017 at 11:40 AM, Kenneth Knowles 
> wrote:
>
> > On Thu, May 4, 2017 at 12:07 PM, Davor Bonaci  wrote:
> >
> > > I'd like to propose the following (tweaked) process for this special
> > > release:
> > >
> > > * Create a release branch, and start building release candidates *now*
> > > This would accelerate branch creation compared to the normal process,
> but
> > > would separate the first stable release from other development on the
> > > master branch. This yields to stability and avoids unnecessary churn.
> > >
> >
> > +1 to cutting a release branch now.
> >
> > This sounds compatible with the release process [1] to me, actually. This
> > thread seems like the dev@ thread where we "decide to release" and I
> agree
> > that we should decide to release. Certainly `master` is not ready nor is
> > the web site - there are ~29 issues as I write this though many are not
> > really significant code changes. But we should never wait until `master`
> is
> > "ready".
> >
> > We know what we want to get done, and there are no radical changes, so I
> > think that makes this the right time to branch. We can easily cherry pick
> > fixes for our burndown list to ensure we don't introduce additional
> > blockers.
> >
> > Some of the burndown list are of the form "investigate if this suspected
> > bug still repros" and a release candidate is the perfect thing to use for
> > that.
> >
> > [1] https://beam.apache.org/contribute/release-guide/#decide-to-release
> >
>

Re: [INFO] Build is broken on the archetypes

2017-05-05 Thread Aviem Zur

Looks like this is due to a bug in generate-sources.sh
Until we fix that bug you can fix your local directory by running the
following:

rm -rf
sdks/java/maven-archetypes/examples/src/main/resources/archetype-resources/src
rm -rf
sdks/java/maven-archetypes/examples-java8/src/main/resources/archetype-resources/src/

On Fri, May 5, 2017 at 3:45 PM Jean-Baptiste Onofré  wrote:

> Sorry for the noise: we have to do a git clean -d -f to actually remove the
> "old" WriteWindowedFileDoFn.
>
> The build is OK on Jenkins and also on my machine now.
>
> Regards
> JB
>
> On 05/05/2017 07:41 AM, Jean-Baptiste Onofré wrote:
> > Hi guys,
> >
> > due to the last changes on the IOChannelFactory and the Beam
> filesystems, the
> > build is broken on the archetypes:
> >
> > [INFO] [ERROR]
> >
> /home/jbonofre/Workspace/beam/sdks/java/maven-archetypes/examples/target/test-classes/projects/basic/project/basic/src/main/java/it/pkg/common/WriteWindowedFilesDoFn.java:[28,32]
> > cannot find symbol
> > [INFO] [ERROR] symbol:   class IOChannelFactory
> > [INFO] [ERROR] location: package org.apache.beam.sdk.util
> > [INFO] [ERROR] -> [Help 1]
> > [INFO] org.apache.maven.lifecycle.LifecycleExecutionException: Failed to
> execute
> > goal org.apache.maven.plugins:maven-compiler-plugin:3.5.1:compile
> > (default-compile) on project basic: Compilation failure
> > [INFO]
> >
> /home/jbonofre/Workspace/beam/sdks/java/maven-archetypes/examples/target/test-classes/projects/basic/project/basic/src/main/java/it/pkg/common/WriteWindowedFilesDoFn.java:[28,32]
> > cannot find symbol
> > [INFO]   symbol:   class IOChannelFactory
> > [INFO]   location: package org.apache.beam.sdk.util
> >
> > I have a pull request that I will submit in a couple of minutes.
> >
> > Regards
> > JB
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: Congratulations Davor!

2017-05-04 Thread Aviem Zur

Congrats Davor! :)

On Thu, May 4, 2017 at 10:42 AM Jean-Baptiste Onofré 
wrote:

> Congrats ! Well deserved ;)
>
> Regards
> JB
>
> On 05/04/2017 09:30 AM, Jason Kuster wrote:
> > Hi all,
> >
> > The ASF has just published a blog post[1] welcoming new members of the
> > Apache Software Foundation, and our own Davor Bonaci is among them!
> > Congratulations and thank you to Davor for all of your work for the Beam
> > community, and the ASF at large. Well deserved.
> >
> > Best,
> >
> > Jason
> >
> > [1] https://blogs.apache.org/foundation/entry/the-apache-sof
> > tware-foundation-welcomes
> >
> > P.S. I dug through the list to make sure I wasn't missing any other Beam
> > community members; if I have, my sincerest apologies and please recognize
> > them on this or a new thread.
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: An Update on Jenkins

2017-04-25 Thread Aviem Zur

Thanks for the update, Jason!

On Wed, Apr 26, 2017 at 6:51 AM Jason Kuster 
wrote:

> Hey folks,
>
> There have been a couple of different issues over the last couple of days
> related to some necessary updates Infra has been working on. We've tracked
> down the last couple of issues, and the latest one seems to be that we're
> being hit by the rate limiter as a result of everything starting back up
> again. They expect that waiting a couple of hours should solve the problem,
> so hopefully by tomorrow things will be back to normal. If not, feel free
> to reply to this thread, and I'll try to keep things up to date with
> status.
>
> Best,
>
> Jason
>
> --
> ---
> Jason Kuster
> Apache Beam / Google Cloud Dataflow
>

Re: Community hackathon

2017-04-25 Thread Aviem Zur

No problem, Sean. Invite sent.

On Tue, Apr 25, 2017 at 6:14 PM Sean Story 
wrote:

> I'd also love to be added to the slack channel
>
>
> Thanks,
>
> Sean Story
>
>
> > On Apr 25, 2017, at 12:54 AM, Davor Bonaci  wrote:
> >
> > Thanks everyone for the enthusiasm!
> >
> > Let's go with this Wednesday, 4/26, starting at 10 AM Pacific time, and
> > running for the following 24 hours. I'll try to seed the
> > instructions/starting point, and then let's take it from there.
> >
> > (Michael, invite sent.)
> >
> > Davor
> >
> > On Mon, Apr 24, 2017 at 7:47 PM, Michael Huston 
> > wrote:
> >
> >> Could you please add me to the Slack channel also? My apologizes for the
> >> noise on this mailing list and if there is a better way to request
> access.
> >>
> >> Cheers,
> >> Michael
> >>
> >> On Mon, Apr 24, 2017 at 6:15 PM, Lukasz Cwik 
> >> wrote:
> >>
> >>> Dylan, sent you invite to slack channel.
> >>>
> >>> On Mon, Apr 24, 2017 at 5:18 PM, Dylan Raithel  >
> >>> wrote:
> >>>
>  Can you please add me to the Slack channel?
> 
>  On Apr 24, 2017 12:51 AM, "Jean-Baptiste Onofré" 
> >>> wrote:
> 
> > That's a wonderful idea !
> >
> > I think the easiest way to organize this event is using the Slack
>  channels
> > to discuss, help each other, and sync together.
> >
> > Regards
> > JB
> >
> > On 04/24/2017 09:48 AM, Davor Bonaci wrote:
> >
> >> We've been working as a community towards the first stable release
> >>> for a
> >> while now, and I think we made a ton of progress across the board
> >> over
>  the
> >> last few weeks.
> >>
> >> We could try to organize a community-wide hackathon to identify and
> >>> fix
> >> those last few issues, as well as to get a better sense of the
> >> overall
> >> project quality as it stands right now.
> >>
> >> This could be a self-organized event, and coordinated via the Slack
> >> channel. For example, we (as a community and participants) can try
> >> out
>  the
> >> project in various ways -- quickstart, examples, different runners,
> >> different platforms -- immediately fixing issues as we run into
> >> them.
> >>> It
> >> could last, say, 24 hours, with people from different time zones
> >> participating at the time of their choosing.
> >>
> >> Thoughts?
> >>
> >> Davor
> >>
> >>
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
> 
> >>>
> >>
>
>

Re: Hanging Jenkins builds.

2017-04-24 Thread Aviem Zur

They did kill a couple of builds that were running, but the builds that
started immediately after suffered the same hanging.
Now it seems builds aren't being started at all for new pushes to PRs. And
invoking 'Retest this please' seems to do nothing as well.

On Mon, Apr 24, 2017 at 8:53 AM Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

> Agree, it's probably due to the update and reconfiguration happening.
>
> Regards
> JB
>
> On 04/24/2017 07:41 AM, Davor Bonaci wrote:
> > Intermittent hanging exist for several days, maybe up to a week. I think
> it
> > correlates with updates/restarts/reconfiguration/etc. of the Jenkins
> > instance on the infrastructure side. Terminating the build and restarting
> > did work around the problem a few times.
> >
> > On Sat, Apr 22, 2017 at 9:42 AM, Ted Yu <yuzhih...@gmail.com> wrote:
> >
> >> Looks like this might be the cause for the failed build (
> >> https://builds.apache.org/view/Beam/job/beam_PreCommit_
> >> Java_MavenInstall/9927/console
> >> ):
> >>
> >> /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Java_
> >> MavenInstall/sdks/java/core/src/main/java/org/apache/beam/
> >> sdk/transforms/CombineFns.java:145:
> >> error: reference notfound
> >>* See {@link #compose()} or {@link #composeKeyed()}) for details.
> >>  ^
> >> /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Java_
> >> MavenInstall/sdks/java/core/src/main/java/org/apache/beam/
> >> sdk/transforms/CombineFns.java:147:
> >> warning - Tag @link:can't find composeKeyed() in
> >> org.apache.beam.sdk.transforms.CombineFns.CoCombineResult
> >>
> >> FYI
> >>
> >> On Fri, Apr 21, 2017 at 11:17 PM, Aviem Zur <aviem...@gmail.com> wrote:
> >>
> >>> Hi all,
> >>>
> >>> Please be aware that Beam builds (precommit + postcommit validations)
> are
> >>> hanging since a few hours ago.
> >>>
> >>> This seems to be a problem in builds of other projects as well (for
> >>> example, Kafka).
> >>>
> >>> I've opened an INFRA ticket:
> >>> https://issues.apache.org/jira/browse/INFRA-13949
> >>>
> >>
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Hanging Jenkins builds.

2017-04-22 Thread Aviem Zur

Hi all,

Please be aware that Beam builds (precommit + postcommit validations) are
hanging since a few hours ago.

This seems to be a problem in builds of other projects as well (for
example, Kafka).

I've opened an INFRA ticket:
https://issues.apache.org/jira/browse/INFRA-13949

Re: [DISCUSSION] PAssert success/failure count validation for all runners

2017-04-18 Thread Aviem Zur

in a dummy value
> to
> >>> > reduce the risk that the verifier transform never runs.
> >>> > 3. Stable URNs for the assertion and verifier transforms so a runner
> >>> has a
> >>> > good chance to wire custom implementations, if it helps.
> >>> >
> >>> > I think someone mentioned it earlier, but these also work better with
> >>> > metrics that overcount, since it is now about covering the verifier
> >>> > transforms rather than an absolute number of successes.
> >>> >
> >>> > Kenn
> >>> >
> >>> >
> >>> >>> On 7. Apr 2017, at 12:42, Kenneth Knowles <k...@google.com.INVALID>
> >>> >> wrote:
> >>> >>>
> >>> >>> We also have a design that improves the signal even without
> metrics,
> >>> so
> >>> >> I'm
> >>> >>> pretty happy with this.
> >>> >>>
> >>> >>> On Fri, Apr 7, 2017 at 12:12 PM, Lukasz Cwik
> >>> <lc...@google.com.invalid>
> >>> >>> wrote:
> >>> >>>
> >>> >>>> I like the usage of metrics since it doesn't depend on external
> >>> >> resources.
> >>> >>>> I believe there could be some small amount of code shared between
> >>> >> runners
> >>> >>>> for the PAssert metric verification.
> >>> >>>>
> >>> >>>> I would say that PAssert by itself and PAssert with metrics are
> two
> >>> >> levels
> >>> >>>> of testing available. For runners that don't support metrics than
> >>> >> PAssert
> >>> >>>> gives a signal (albeit weaker one) and ones that do support
> metrics
> >>> will
> >>> >>>> have a stronger signal for execution correctness.
> >>> >>>>
> >>> >>>> On Fri, Apr 7, 2017 at 11:59 AM, Aviem Zur <aviem...@gmail.com>
> >>> wrote:
> >>> >>>>
> >>> >>>>> Currently, PAssert assertions may not happen and tests will pass
> >>> while
> >>> >>>>> silently hiding issues.
> >>> >>>>>
> >>> >>>>> Up until now, several runners have implemented an assertion that
> >>> the
> >>> >>>> number
> >>> >>>>> of expected successful assertions have actually happened, and
> that
> >>> no
> >>> >>>>> failed assertions have happened. (runners which check this are
> >>> Dataflow
> >>> >>>>> runner and Spark runner).
> >>> >>>>>
> >>> >>>>> This has been valuable in the past to find bugs which were hidden
> >>> by
> >>> >>>>> passing tests.
> >>> >>>>>
> >>> >>>>> The work to repeat this in https://issues.apache.org/
> >>> >>>> jira/browse/BEAM-1726
> >>> >>>>> has
> >>> >>>>> surfaced bugs in the Flink runner that were also hidden by
> passing
> >>> >> tests.
> >>> >>>>> However, with the removal of aggregators in
> >>> >>>>> https://issues.apache.org/jira/browse/BEAM-1148 this ticket will
> >>> be
> >>> >>>> harder
> >>> >>>>> to implement, since Flink runner does not support metrics.
> >>> >>>>>
> >>> >>>>> I believe that validating that runners do in fact support Beam
> >>> model
> >>> >> is a
> >>> >>>>> blocker for first stable release. (BEAM-1726 was also marked as a
> >>> >> blocker
> >>> >>>>> for Flink runner).
> >>> >>>>>
> >>> >>>>> I think we have one of 2 choices here:
> >>> >>>>> 1. Keep implementing this for each runner separately.
> >>> >>>>> 2. Implement this in a runner agnostic way (For runners which
> >>> support
> >>> >>>>> metrics - use metrics, for those that do not use a fallback
> >>> >>>> implementation,
> >>> >>>>> perhaps using files or some other method). This should be covered
> >>> by
> >>> >> the
> >>> >>>>> following ticket:
> https://issues.apache.org/jira/browse/BEAM-1763
> >>> >>>>>
> >>> >>>>> Thoughts?
> >>> >>>>>
> >>> >>>>
> >>> >>
> >>> >>
> >>>
> >>>
> >>
> >
>

Re: Pipeline termination in the unified Beam model

2017-04-16 Thread Aviem Zur

+1

To help integrate this we can start by adding `ValidatesRunner` tests with
a new category and run it only with runners which adhere to the rules
mentioned, and eventually in all runners.

On Fri, Mar 3, 2017 at 12:46 AM Amit Sela  wrote:

> +1 on Eugene's words - this shows how batch is conceptually a subset of a
> streaming problem.
> I also believe that Stas has a very good point on education - we have to
> try and understand developer's current perspective and try to make the
> transition to the Beam model as natural as possible for new users.
> In addition to good documentation and examples, I think that
> https://issues.apache.org/jira/browse/BEAM-849 is critical, as this is the
> user's end-point to the behaviours discussed here, and so it should be:
> * clear and concise - pipeline state at any point should be informative.
> * well documented - documentation, examples, and use-cases (e.g., Eugene's
> "poison pill").
> * strict API for runners - joining Stas' not on unified implementation for
> portability.
>
> On Thu, Mar 2, 2017 at 8:49 PM Eugene Kirpichov
>  wrote:
>
> > OK, I'm glad everybody is in agreement on this. I raised this point
> because
> > we've been discussing implementing this behavior in the Dataflow
> streaming
> > runner, and I wanted to make sure that people are okay with it from a
> > conceptual point of view before proceeding.
> >
> > On Thu, Mar 2, 2017 at 10:27 AM Kenneth Knowles 
> > wrote:
> >
> > Isn't this already the case? I think semantically it is an unavoidable
> > conclusion, so certainly +1 to that.
> >
> > The DirectRunner and TestDataflowRunner both have this behavior already.
> > I've always considered that a streaming job running forever is just
> [very]
> > suboptimal shutdown latency :-)
> >
> > Some bits of the discussion on the ticket seem to surround whether or how
> > to communicate this property in a generic way. Since a runner owns its
> > PipelineResult it doesn't seem necessary.
> >
> > So is the bottom line just that you want to more strongly insist that
> > runners really terminate in a timely manner? I'm +1 to that, too, for
> > basically the reason Stas gives: In order to easily programmatically
> > orchestrate Beam pipelines in a portable way, you do need to know whether
> > the pipeline will finish without thinking about the specific runner and
> its
> > options (as with our RunnableOnService tests).
> >
> > Kenn
> >
> > On Thu, Mar 2, 2017 at 9:09 AM, Dan Halperin  >
> > wrote:
> >
> > > Note that even "unbounded pipeline in a streaming
> > runner".waitUntilFinish()
> > > can return, e.g., if you cancel it or terminate it. It's totally
> > reasonable
> > > for users to want to understand and handle these cases.
> > >
> > > +1
> > >
> > > Dan
> > >
> > > On Thu, Mar 2, 2017 at 2:53 AM, Jean-Baptiste Onofré 
> > > wrote:
> > >
> > > > +1
> > > >
> > > > Good idea !!
> > > >
> > > > Regards
> > > > JB
> > > >
> > > >
> > > > On 03/02/2017 02:54 AM, Eugene Kirpichov wrote:
> > > >
> > > >> Raising this onto the mailing list from
> > > >> https://issues.apache.org/jira/browse/BEAM-849
> > > >>
> > > >> The issue came up: what does it mean for a pipeline to finish, in
> the
> > > Beam
> > > >> model?
> > > >>
> > > >> Note that I am deliberately not talking about "batch" and
> "streaming"
> > > >> pipelines, because this distinction does not exist in the model.
> > Several
> > > >> runners have batch/streaming *modes*, which implement the same
> > semantics
> > > >> (potentially different subsets: in batch mode typically a runner
> will
> > > >> reject pipelines that have at least one unbounded PCollection) but
> in
> > an
> > > >> operationally different way. However we should define pipeline
> > > termination
> > > >> at the level of the unified model, and then make sure that all
> runners
> > > in
> > > >> all modes implement that properly.
> > > >>
> > > >> One natural way is to say "a pipeline terminates when the output
> > > >> watermarks
> > > >> of all of its PCollection's progress to +infinity". (Note: this can
> be
> > > >> generalized, I guess, to having partial executions of a pipeline: if
> > > >> you're
> > > >> interested in the full contents of only some collections, then you
> > wait
> > > >> until only the watermarks of those collections progress to infinity)
> > > >>
> > > >> A typical "batch" runner mode does not implement watermarks - we can
> > > think
> > > >> of it as assigning watermark -infinity to an output of a transform
> > that
> > > >> hasn't started executing yet, and +infinity to output of a transform
> > > that
> > > >> has finished executing. This is consistent with how such runners
> > > implement
> > > >> termination in practice.
> > > >>
> > > >> Dataflow streaming runner additionally implements such termination
> for
> > > >> pipeline drain operation: it has 2 parts: 1) stop consuming

Re: Renaming SideOutput

2017-04-11 Thread Aviem Zur

+1

On Wed, Apr 12, 2017 at 6:06 AM JingsongLee  wrote:

> strong +1
> best,
> JingsongLee--From:Tang
> Jijun(上海_技术部_数据平台_唐觊隽) Time:2017 Apr 12 (Wed)
> 10:39To:dev@beam.apache.org Subject:答复: Renaming
> SideOutput
> +1 more clearer
>
>
> -邮件原件-
> 发件人: Ankur Chauhan [mailto:an...@malloc64.com]
> 发送时间: 2017年4月12日 10:36
> 收件人: dev@beam.apache.org
> 主题: Re: Renaming SideOutput
>
>
> +1 this is pretty much the topmost things that I found odd when starting with 
> the beam model. It would definitely be more intuitive to have a consistent 
> name.
>
> Sent from my iPhone
>
> > On Apr 11, 2017, at 18:29, Aljoscha Krettek  wrote:
> >
> > +1
> >
> >> On Wed, Apr 12, 2017, at 02:34, Thomas Groh wrote:
> >> I think that's a good idea. I would call the outputs of a ParDo the
> >> "Main Output" and "Additional Outputs" - it seems like an easy way to
> >> make it clear that there's one output that is always expected, and
> >> there may be more.
> >>
> >> On Tue, Apr 11, 2017 at 5:29 PM, Robert Bradshaw <
> >> rober...@google.com.invalid> wrote:
> >>
> >>> We should do some renaming in Python too. Right now we have
> >>> SideOutputValue which I'd propose naming TaggedOutput or something
> >>> like that.
> >>>
> >>> Should the docs change too?
> >>> https://beam.apache.org/documentation/programming-guide/#transforms-
> >>> sideio
> >>>
> >>> On Tue, Apr 11, 2017 at 5:25 PM, Kenneth Knowles
> >>> 
> >>> wrote:
>  +1 ditto about sideInput and sideOutput not actually being related
> 
>  On Tue, Apr 11, 2017 at 3:52 PM, Robert Bradshaw <
>  rober...@google.com.invalid> wrote:
> 
> > +1, I think this is a lot clearer.
> >
> > On Tue, Apr 11, 2017 at 2:24 PM, Stephen Sisk
> > 
> > wrote:
> >> strong +1 for changing the name away from sideOutput - the fact
> >> that sideInput and sideOutput are not really related was
> >> definitely a
> >>> source
> > of
> >> confusion for me when learning beam.
> >>
> >> S
> >>
> >> On Tue, Apr 11, 2017 at 1:56 PM Thomas Groh
> >>  
> >> wrote:
> >>
> >>> Hey everyone:
> >>>
> >>> I'd like to rename DoFn.Context#sideOutput to #output (in the
> >>> Java
> >>> SDK).
> >>>
> >>> Having two methods, both named output, one which takes the "main
> >>> output
> >>> type" and one that takes a tag to specify the type more clearly
> >>> communicates the actual behavior - sideOutput isn't a "special"
> >>> way
> >>> to
>
> >>> output, it's the same as output(T), just to a specified PCollection.
> > This
> >>> will help pipeline authors understand the actual behavior of
> >>> outputting
> > to
> >>> a tag, and detangle it from "sideInput", which is a special way
> >>> to
> > receive
> >>> input. Giving them the same name means that it's not even
> >>> strange to
> > call
> >>> output and provide the main output type, which is what we want -
> >>> it's a
> >>> more specific way to output, but does not have different
> >>> restrictions or
> >>> capabilities.
> >>>
> >>> This is also a pretty small change within the SDK - it touches
> >>> about
> >>> 20
> >>> files, and the changes are pretty automatic.
> >>>
> >>> Thanks,
> >>>
> >>> Thomas
> >>>
> >
> >>>
>

[PROPOSAL] Standard IO Metrics

2017-04-08 Thread Aviem Zur

Hi all,

We are currently in the process of introducing IO metrics to Beam.

Questions have been raised as to what the metrics names should be, and if
they should be standard across different IOs.

I've written this up as a proposal found here:
https://s.apache.org/standard-io-metrics

As usual, this document is commentable, please go over it and make comments
where appropriate.

Re: Combine.Global

2017-04-07 Thread Aviem Zur

I wasn't able to reproduce the issue you're experiencing.
I've created a gist with an example that works and is similar to what you
have described.
Please help us make tweaks to the gist reproduce your problem:
https://gist.github.com/aviemzur/ba213d98b4484492099b3cf709ddded0

On Fri, Apr 7, 2017 at 7:25 PM Paul Gerver <pfger...@gmail.com> wrote:

> Yes, the pipeline is quite small:
>
> pipeline.apply("source",
> Read.from(new CustomSource())).setCoder(CustomSource.coder)
> .apply("GlobalCombine", Combine.globally(new
> CustomCombineFn())).setCoder(CustomTuple.coder);
>
>
> The InputT is not the same as OutputT, so the input coder can't be used.
>
> On 2017-04-07 08:58 (-0500), Aviem Zur <aviem...@gmail.com> wrote:
> > Have you set the coder for your input PCollection? The one on which you
> > perform the Combine?
> >
> > On Fri, Apr 7, 2017 at 4:24 PM Paul Gerver <pfger...@gmail.com> wrote:
> >
> > > Hello All,
> > >
> > > I'm trying to test out a Combine.Globally transform which takes in a
> small
> > > custom class (CustomA) and outputs a secondary custom class (CustomB).
> I
> > > have set the coder for the resulting PCollection, but Beam is
> > > arguing that a coder for a KV type is missing (see output at bottom).
> > >
> > > Since this a global combine, the input nor the output is of KV type,
> so I
> > > decided to take a look at the Combine code. Since
> Combine.Globally.expand()
> > > performs a perKeys and groupedValues underneath the covers, but
> requires
> > > making an intermediate PCollection KV<Void, OutputT> which--according
> to
> > > the docs--is inferred from the CombineFn.
> > >
> > > I believe I could workaround this by registering a KvCoder with the
> > > CoderRegistry, but that's not intuitive. Is there a better way to
> address
> > > this currently, or should something be added to the CombineFn area for
> > > setting an output coder similar to PCollection.
> > >
> > >
> > > Output:
> > > Exception in thread "main" java.lang.IllegalStateException: Unable to
> > > return a default Coder for
> > >
> > >
> GlobalCombine/Combine.perKey(CustomTuple)/Combine.GroupedValues/ParDo(Anonymous).out
> > > [Class]. Correct one of the following root causes:
> > >   No Coder has been manually specified;  you may do so using
> .setCoder().
> > >   Inferring a Coder from the CoderRegistry failed: Unable to provide a
> > > default Coder for org.apache.beam.sdk.values.KV<K, OutputT>. Correct
> one of
> > > the following root causes:
> > >   Building a Coder using a registered CoderFactory failed: Cannot
> provide
> > > coder for parameterized type org.apache.beam.sdk.values.KV<K, OutputT>:
> > > Unable to provide a default Coder for java.lang.Object. Correct one of
> the
> > > following root causes:
> > >
> > >
> > > Stack:
> > > at
> > >
> > >
> org.apache.beam.sdk.repackaged.com.google.common.base.Preconditions.checkState(Preconditions.java:174)
> > > at
> > > org.apache.beam.sdk.values.TypedPValue.getCoder(TypedPValue.java:51)
> > > at
> > > org.apache.beam.sdk.values.PCollection.getCoder(PCollection.java:130)
> > > at
> > >
> > >
> org.apache.beam.sdk.values.TypedPValue.finishSpecifying(TypedPValue.java:90)
> > > at
> > >
> > >
> org.apache.beam.sdk.runners.TransformHierarchy.finishSpecifyingInput(TransformHierarchy.java:95)
> > > at
> org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:386)
> > > at
> org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:302)
> > > at
> > > org.apache.beam.sdk.values.PCollection.apply(PCollection.java:154)
> > > at
> > >
> org.apache.beam.sdk.transforms.Combine$Globally.expand(Combine.java:1460)
> > > at
> > >
> org.apache.beam.sdk.transforms.Combine$Globally.expand(Combine.java:1337)
> > > at
> > >
> org.apache.beam.sdk.runners.PipelineRunner.apply(PipelineRunner.java:76)
> > > at
> > >
> org.apache.beam.runners.direct.DirectRunner.apply(DirectRunner.java:296)
> > > at
> org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:388)
> > > at
> org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:318)
> > > at
> > > org.apache.beam.sdk.values.PCollection.apply(PCollection.java:167)
> > > at
> > > org.iastate.edu.CombineTestPipeline.main(CombineTestPipeline.java:110)
> > >
> > >
> > > Let me know. Thanks!
> > > -Paul G
> > >
> > > --
> > > *Paul Gerver*
> > > pfger...@gmail.com
> > >
> >
>

[DISCUSSION] PAssert success/failure count validation for all runners

2017-04-07 Thread Aviem Zur

Currently, PAssert assertions may not happen and tests will pass while
silently hiding issues.

Up until now, several runners have implemented an assertion that the number
of expected successful assertions have actually happened, and that no
failed assertions have happened. (runners which check this are Dataflow
runner and Spark runner).

This has been valuable in the past to find bugs which were hidden by
passing tests.

The work to repeat this in https://issues.apache.org/jira/browse/BEAM-1726 has
surfaced bugs in the Flink runner that were also hidden by passing tests.
However, with the removal of aggregators in
https://issues.apache.org/jira/browse/BEAM-1148 this ticket will be harder
to implement, since Flink runner does not support metrics.

I believe that validating that runners do in fact support Beam model is a
blocker for first stable release. (BEAM-1726 was also marked as a blocker
for Flink runner).

I think we have one of 2 choices here:
1. Keep implementing this for each runner separately.
2. Implement this in a runner agnostic way (For runners which support
metrics - use metrics, for those that do not use a fallback implementation,
perhaps using files or some other method). This should be covered by the
following ticket: https://issues.apache.org/jira/browse/BEAM-1763

Thoughts?

Re: Combine.Global

2017-04-07 Thread Aviem Zur

Have you set the coder for your input PCollection? The one on which you
perform the Combine?

On Fri, Apr 7, 2017 at 4:24 PM Paul Gerver  wrote:

> Hello All,
>
> I'm trying to test out a Combine.Globally transform which takes in a small
> custom class (CustomA) and outputs a secondary custom class (CustomB). I
> have set the coder for the resulting PCollection, but Beam is
> arguing that a coder for a KV type is missing (see output at bottom).
>
> Since this a global combine, the input nor the output is of KV type, so I
> decided to take a look at the Combine code. Since Combine.Globally.expand()
> performs a perKeys and groupedValues underneath the covers, but requires
> making an intermediate PCollection KV which--according to
> the docs--is inferred from the CombineFn.
>
> I believe I could workaround this by registering a KvCoder with the
> CoderRegistry, but that's not intuitive. Is there a better way to address
> this currently, or should something be added to the CombineFn area for
> setting an output coder similar to PCollection.
>
>
> Output:
> Exception in thread "main" java.lang.IllegalStateException: Unable to
> return a default Coder for
>
> GlobalCombine/Combine.perKey(CustomTuple)/Combine.GroupedValues/ParDo(Anonymous).out
> [Class]. Correct one of the following root causes:
>   No Coder has been manually specified;  you may do so using .setCoder().
>   Inferring a Coder from the CoderRegistry failed: Unable to provide a
> default Coder for org.apache.beam.sdk.values.KV. Correct one of
> the following root causes:
>   Building a Coder using a registered CoderFactory failed: Cannot provide
> coder for parameterized type org.apache.beam.sdk.values.KV:
> Unable to provide a default Coder for java.lang.Object. Correct one of the
> following root causes:
>
>
> Stack:
> at
>
> org.apache.beam.sdk.repackaged.com.google.common.base.Preconditions.checkState(Preconditions.java:174)
> at
> org.apache.beam.sdk.values.TypedPValue.getCoder(TypedPValue.java:51)
> at
> org.apache.beam.sdk.values.PCollection.getCoder(PCollection.java:130)
> at
>
> org.apache.beam.sdk.values.TypedPValue.finishSpecifying(TypedPValue.java:90)
> at
>
> org.apache.beam.sdk.runners.TransformHierarchy.finishSpecifyingInput(TransformHierarchy.java:95)
> at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:386)
> at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:302)
> at
> org.apache.beam.sdk.values.PCollection.apply(PCollection.java:154)
> at
> org.apache.beam.sdk.transforms.Combine$Globally.expand(Combine.java:1460)
> at
> org.apache.beam.sdk.transforms.Combine$Globally.expand(Combine.java:1337)
> at
> org.apache.beam.sdk.runners.PipelineRunner.apply(PipelineRunner.java:76)
> at
> org.apache.beam.runners.direct.DirectRunner.apply(DirectRunner.java:296)
> at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:388)
> at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:318)
> at
> org.apache.beam.sdk.values.PCollection.apply(PCollection.java:167)
> at
> org.iastate.edu.CombineTestPipeline.main(CombineTestPipeline.java:110)
>
>
> Let me know. Thanks!
> -Paul G
>
> --
> *Paul Gerver*
> pfger...@gmail.com
>

Re: [DISCUSSION] Consistent use of loggers

2017-04-06 Thread Aviem Zur

>IMO I don't think the DirectRunner should depend directly on any specific 
>logging
backend (at least, not in the compile or runtime scopes). I think it should
depend on JUL in the test scope, so that there are logs when executing
DirectRunner tests.
>My reasoning: I can see in any binary version of Beam that the SDK, the 
>DirectRunner,
and 1 or more other runners will all be on the classpath.
>Ideally this should work regardless of whatever other runner is used; 
>presumably
the DirectRunner would "automagically" pick up the logging config of the
other runner.
That sounds like a very plausible scenario and this would "protect" the
runner's binding from an intruding binding from direct runner, since it
would have no binding.
However, there is also the scenario that a user runs the examples using
direct runner,
this is their first interaction with Beam, and they see no logs whatsoever,
they would have to add a binding.
We could solve this by adding a binding in the 'direct-runner' profile in
examples module and the maven archetypes (And allow only one runner profile
to be specified at a time, in case their logger binding clashes).

>I like the use of slf4j as it enables lots of publishers of logs, but I don't
want to supply a default/required consumer of logs because that will restrict
use cases in the future...
I agree, forcing log4j binding might give the user a false sense of: "all
runners use log4j" while this might not be true for future (and isn't true
today, for Dataflow runner), but we can't assure that future runners could
support this.

So it seems we're left with:
1) Add documentation around logging in each runner.
2) Consider enabling a binding (JUL) for direct runner profile in examples
module and maven archetypes.
3) Allow only one runner profile to be active at a time in examples and
maven archetypes as their logger binding might clash.

Thoughts?

On Tue, Apr 4, 2017 at 8:51 AM Dan Halperin <dhalp...@google.com.invalid>
wrote:

> At this point, I'm a little unclear on what is the proposal. Can you
> refresh a simplified/aggregated view after this conversation?
>
> IMO I don't think the DirectRunner should depend directly on any specific
> logging backend (at least, not in the compile or runtime scopes). I think
> it should depend on JUL in the test scope, so that there are logs when
> executing DirectRunner tests.
>
> My reasoning: I can see in any binary version of Beam that the SDK, the
> DirectRunner, and 1 or more other runners will all be on the classpath.
> Ideally this should work regardless of whatever other runner is used;
> presumably the DirectRunner would "automagically" pick up the logging
> config of the other runner.
>
> I like the use of slf4j as it enables lots of publishers of logs, but I
> don't want to supply a default/required consumer of logs because that will
> restrict use cases in the future...
>
> On Mon, Apr 3, 2017 at 8:14 PM, Jean-Baptiste Onofré <j...@nanthrax.net>
> wrote:
>
> > Fair enough. +1 especially for the documentation.
> >
> > Regards
> > JB
> >
> >
> > On 04/03/2017 08:48 PM, Aviem Zur wrote:
> >
> >> Upon further inspection there seems to be an issue we may have
> overlooked:
> >> In cluster mode, some of the runners will have dependencies added
> directly
> >> to the classpath by the cluster, and since SLF4J can only work with one
> >> binding, the first one in the classpath will be used.
> >>
> >> So while what we suggested would work in local mode, the user's chosen
> >> binding and configuration might be ignored in cluster mode, which is
> >> detrimental to what we wanted to accomplish.
> >>
> >> So I believe what we should do instead is:
> >>
> >>1. Add better documentation regarding logging in each runner, which
> >>binding is used, perhaps examples of how to configure logging for
> that
> >>runner.
> >>2. Have direct runner use the most common binding among runners (this
> >>appears to be log4j which is used by Spark runner, Flink runner and
> >> Apex
> >>runner).
> >>
> >>
> >> On Mon, Apr 3, 2017 at 7:02 PM Aljoscha Krettek <aljos...@apache.org>
> >> wrote:
> >>
> >> Yes, I think we can exclude log4j from the Flink dependencies. It’s
> >>> somewhat annoying that they are there in the first place.
> >>>
> >>> The Flink doc has this to say about the topic:
> >>> https://ci.apache.org/projects/flink/flink-docs-release-1.3/
> >>> monitoring/logging.html
> >>>
> >>>> On 3. Apr 2017, at 17:56, Aviem Zur <aviem...@gmail.c

Re: [DISCUSSION] Consistent use of loggers

2017-04-03 Thread Aviem Zur

Upon further inspection there seems to be an issue we may have overlooked:
In cluster mode, some of the runners will have dependencies added directly
to the classpath by the cluster, and since SLF4J can only work with one
binding, the first one in the classpath will be used.

So while what we suggested would work in local mode, the user's chosen
binding and configuration might be ignored in cluster mode, which is
detrimental to what we wanted to accomplish.

So I believe what we should do instead is:

   1. Add better documentation regarding logging in each runner, which
   binding is used, perhaps examples of how to configure logging for that
   runner.
   2. Have direct runner use the most common binding among runners (this
   appears to be log4j which is used by Spark runner, Flink runner and Apex
   runner).


On Mon, Apr 3, 2017 at 7:02 PM Aljoscha Krettek <aljos...@apache.org> wrote:

> Yes, I think we can exclude log4j from the Flink dependencies. It’s
> somewhat annoying that they are there in the first place.
>
> The Flink doc has this to say about the topic:
> https://ci.apache.org/projects/flink/flink-docs-release-1.3/monitoring/logging.html
> > On 3. Apr 2017, at 17:56, Aviem Zur <aviem...@gmail.com> wrote:
> >
> >> * java.util.logging could be a good choice for the Direct Runner
> > Yes, this will be great for users (Instead of having no logging when
> using
> > direct runner).
> >
> >> * Logging backend could be runner-specific, particularly if it needs to
> >> integrate into some other experience
> > Good point, let's take a look at the current state of runners:
> > Direct runner - will use JUL as suggested.
> > Dataflow runner - looks like there is already no binding (There is a
> > binding in tests only).
> > Spark runner - currently uses slf4j-log4j12. does not require any
> specific
> > logger, we can change this to no binding.
> > Flink runner - uses slf4j-log4j12 transitively from Flink dependencies.
> I'm
> > assuming this is not a must and we can default to no binding here.
> > @aljoscha please confirm.
> > Apex runner - uses slf4j-log4j12 transitively from Apex dependencies. I'm
> > assuming this is not a must and we can default to no binding here. @thw
> > please confirm.
> >
> > It might be a good idea to use a consistent binding in tests (Since we'll
> > use JUL for direct runner, let this be JUL).
> >
> > On Wed, Mar 29, 2017 at 7:23 PM Davor Bonaci <da...@apache.org> wrote:
> >
> > +1 on consistency across Beam modules on the logging facade
> > +1 on enforcing consistency
> > +1 on clearly documenting how to do logging
> >
> > Mixed feelings:
> > * Logging backend could be runner-specific, particularly if it needs to
> > integrate into some other experience
> > * java.util.logging could be a good choice for the Direct Runner
> >
> > On Tue, Mar 28, 2017 at 6:50 PM, Ahmet Altay <al...@google.com.invalid>
> > wrote:
> >
> >> On Wed, Mar 22, 2017 at 10:38 AM, Tibor Kiss <tk...@hortonworks.com>
> >> wrote:
> >>
> >>> This is a great idea!
> >>>
> >>> I believe Python-SDK's logging could also be enhanced (a bit
> >> differently):
> >>> Currently we are not instantiating the logger, just using the class
> what
> >>> logging package provides.
> >>> Shortcoming of this approach is that the user cannot set the log level
> > on
> >>> a per module basis as all log messages
> >>> end up in the root level.
> >>>
> >>
> >> +1 to this. Python SDK needs to expands its logging capabilities. Filed
> > [1]
> >> for this.
> >>
> >> Ahmet
> >>
> >> [1] https://issues.apache.org/jira/browse/BEAM-1825
> >>
> >>
> >>>
> >>> On 3/22/17, 5:46 AM, "Aviem Zur" <aviem...@gmail.com> wrote:
> >>>
> >>>+1 to what JB said.
> >>>
> >>>Will just have to be documented well as if we provide no binding
> >> there
> >>> will
> >>>be no logging out of the box unless the user adds a binding.
> >>>
> >>>On Wed, Mar 22, 2017 at 6:24 AM Jean-Baptiste Onofré <
> >> j...@nanthrax.net>
> >>>wrote:
> >>>
> >>>> Hi Aviem,
> >>>>
> >>>> Good point.
> >>>>
> >>>> I think, in our dependencies set, we should just depend to
> >> slf4j-api
> >>> and
> >>>> let the
> >>>> user provi

Re: [DISCUSSION] Consistent use of loggers

2017-04-03 Thread Aviem Zur

>* java.util.logging could be a good choice for the Direct Runner
Yes, this will be great for users (Instead of having no logging when using
direct runner).

>* Logging backend could be runner-specific, particularly if it needs to
>integrate into some other experience
Good point, let's take a look at the current state of runners:
Direct runner - will use JUL as suggested.
Dataflow runner - looks like there is already no binding (There is a
binding in tests only).
Spark runner - currently uses slf4j-log4j12. does not require any specific
logger, we can change this to no binding.
Flink runner - uses slf4j-log4j12 transitively from Flink dependencies. I'm
assuming this is not a must and we can default to no binding here.
@aljoscha please confirm.
Apex runner - uses slf4j-log4j12 transitively from Apex dependencies. I'm
assuming this is not a must and we can default to no binding here. @thw
please confirm.

It might be a good idea to use a consistent binding in tests (Since we'll
use JUL for direct runner, let this be JUL).

On Wed, Mar 29, 2017 at 7:23 PM Davor Bonaci <da...@apache.org> wrote:

+1 on consistency across Beam modules on the logging facade
+1 on enforcing consistency
+1 on clearly documenting how to do logging

Mixed feelings:
* Logging backend could be runner-specific, particularly if it needs to
integrate into some other experience
* java.util.logging could be a good choice for the Direct Runner

On Tue, Mar 28, 2017 at 6:50 PM, Ahmet Altay <al...@google.com.invalid>
wrote:

> On Wed, Mar 22, 2017 at 10:38 AM, Tibor Kiss <tk...@hortonworks.com>
> wrote:
>
> > This is a great idea!
> >
> > I believe Python-SDK's logging could also be enhanced (a bit
> differently):
> > Currently we are not instantiating the logger, just using the class what
> > logging package provides.
> > Shortcoming of this approach is that the user cannot set the log level
on
> > a per module basis as all log messages
> > end up in the root level.
> >
>
> +1 to this. Python SDK needs to expands its logging capabilities. Filed
[1]
> for this.
>
> Ahmet
>
> [1] https://issues.apache.org/jira/browse/BEAM-1825
>
>
> >
> > On 3/22/17, 5:46 AM, "Aviem Zur" <aviem...@gmail.com> wrote:
> >
> > +1 to what JB said.
> >
> > Will just have to be documented well as if we provide no binding
> there
> > will
> > be no logging out of the box unless the user adds a binding.
> >
> > On Wed, Mar 22, 2017 at 6:24 AM Jean-Baptiste Onofré <
> j...@nanthrax.net>
> > wrote:
> >
> > > Hi Aviem,
> > >
> > > Good point.
> > >
> > > I think, in our dependencies set, we should just depend to
> slf4j-api
> > and
> > > let the
> > > user provides the binding he wants (slf4j-log4j12, slf4j-simple,
> > whatever).
> > >
> > > We define a binding only with test scope in our modules.
> > >
> > > Regards
> > > JB
> > >
> > > On 03/22/2017 04:58 AM, Aviem Zur wrote:
> > > > Hi all,
> > > >
> > > > There have been a few reports lately (On JIRA [1] and on Slack)
> > from
> > > users
> > > > regarding inconsistent loggers used across Beam's modules.
> > > >
> > > > While we use SLF4J, different modules use a different logger
> > behind it
> > > > (JUL, log4j, etc)
> > > > So when people add a log4j.properties file to their classpath
for
> > > instance,
> > > > they expect this to affect all of their dependencies on Beam
> > modules, but
> > > > it doesn’t and they miss out on some logs they thought they
would
> > see.
> > > >
> > > > I think we should strive for consistency in which logger is used
> > behind
> > > > SLF4J, and try to enforce this in our modules.
> > > > I for one think it should be slf4j-log4j. However, if
performance
> > of
> > > > logging is critical we might want to consider logback.
> > > >
> > > > Note: SLF4J will still be the facade for logging across the
> > project. The
> > > > only change would be the logger SLF4J delegates to.
> > > >
> > > > Once we have something like this it would also be useful to add
> > > > documentation on logging in Beam to the website.
> > > >
> > > > [1] https://issues.apache.org/jira/browse/BEAM-1757
> > > >
> > >
> > > --
> > > Jean-Baptiste Onofré
> > > jbono...@apache.org
> > > http://blog.nanthrax.net
> > > Talend - http://www.talend.com
> > >
> >
> >
> >
>

Re: [DISCUSSION] Consistent use of loggers

2017-03-21 Thread Aviem Zur

+1 to what JB said.

Will just have to be documented well as if we provide no binding there will
be no logging out of the box unless the user adds a binding.

On Wed, Mar 22, 2017 at 6:24 AM Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

> Hi Aviem,
>
> Good point.
>
> I think, in our dependencies set, we should just depend to slf4j-api and
> let the
> user provides the binding he wants (slf4j-log4j12, slf4j-simple, whatever).
>
> We define a binding only with test scope in our modules.
>
> Regards
> JB
>
> On 03/22/2017 04:58 AM, Aviem Zur wrote:
> > Hi all,
> >
> > There have been a few reports lately (On JIRA [1] and on Slack) from
> users
> > regarding inconsistent loggers used across Beam's modules.
> >
> > While we use SLF4J, different modules use a different logger behind it
> > (JUL, log4j, etc)
> > So when people add a log4j.properties file to their classpath for
> instance,
> > they expect this to affect all of their dependencies on Beam modules, but
> > it doesn’t and they miss out on some logs they thought they would see.
> >
> > I think we should strive for consistency in which logger is used behind
> > SLF4J, and try to enforce this in our modules.
> > I for one think it should be slf4j-log4j. However, if performance of
> > logging is critical we might want to consider logback.
> >
> > Note: SLF4J will still be the facade for logging across the project. The
> > only change would be the logger SLF4J delegates to.
> >
> > Once we have something like this it would also be useful to add
> > documentation on logging in Beam to the website.
> >
> > [1] https://issues.apache.org/jira/browse/BEAM-1757
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: [ANNOUNCEMENT] New committers, March 2017 edition!

2017-03-18 Thread Aviem Zur

Thanks all! Very excited to join.
Congratulations to other new committers!

On Sat, Mar 18, 2017 at 2:17 AM Thomas Weise <thomas.we...@gmail.com> wrote:

> Congrats!
>
>
> On Fri, Mar 17, 2017 at 4:28 PM, Chamikara Jayalath <chamik...@apache.org>
> wrote:
>
> > Thanks all. Congrats to other new committers !!
> >
> > I'm very excited to join.
> >
> > - Cham
> >
> > On Fri, Mar 17, 2017 at 3:02 PM Mark Liu <mark...@google.com.invalid>
> > wrote:
> >
> > > Congrats to all of them!
> > >
> > > On Fri, Mar 17, 2017 at 2:24 PM, Neelesh Salian <
> > neeleshssal...@gmail.com>
> > > wrote:
> > >
> > > > Congratulations!
> > > >
> > > > On Fri, Mar 17, 2017 at 2:16 PM, Kenneth Knowles
> > <k...@google.com.invalid
> > > >
> > > > wrote:
> > > >
> > > > > Congrats all!
> > > > >
> > > > > On Fri, Mar 17, 2017 at 2:13 PM, Davor Bonaci <da...@apache.org>
> > > wrote:
> > > > >
> > > > > > Please join me and the rest of Beam PMC in welcoming the
> following
> > > > > > contributors as our newest committers. They have significantly
> > > > > contributed
> > > > > > to the project in different ways, and we look forward to many
> more
> > > > > > contributions in the future.
> > > > > >
> > > > > > * Chamikara Jayalath
> > > > > > Chamikara has been contributing to Beam since inception, and
> > > previously
> > > > > to
> > > > > > Google Cloud Dataflow, accumulating a total of 51 commits (8,301
> > ++ /
> > > > > 3,892
> > > > > > --) since February 2016 [1]. He contributed broadly to the
> project,
> > > but
> > > > > > most significantly to the Python SDK, building the IO framework
> in
> > > this
> > > > > SDK
> > > > > > [2], [3].
> > > > > >
> > > > > > * Eugene Kirpichov
> > > > > > Eugene has been contributing to Beam since inception, and
> > previously
> > > to
> > > > > > Google Cloud Dataflow, accumulating a total of 95 commits (22,122
> > ++
> > > /
> > > > > > 18,407 --) since February 2016 [1]. In recent months, he’s been
> > > driving
> > > > > the
> > > > > > Splittable DoFn effort [4]. A true expert on IO subsystem, Eugene
> > has
> > > > > > reviewed nearly every IO contributed to Beam. Finally, Eugene
> > > > contributed
> > > > > > the Beam Style Guide, and is championing it across the project.
> > > > > >
> > > > > > * Ismaël Mejia
> > > > > > Ismaël has been contributing to Beam since mid-2016,
> accumulating a
> > > > total
> > > > > > of 35 commits (3,137 ++ / 1,328 --) [1]. He authored the HBaseIO
> > > > > connector,
> > > > > > helped on the Spark runner, and contributed in other areas as
> well,
> > > > > > including cross-project collaboration with Apache Zeppelin.
> Ismaël
> > > > > reported
> > > > > > 24 Jira issues.
> > > > > >
> > > > > > * Aviem Zur
> > > > > > Aviem has been contributing to Beam since early fall,
> accumulating
> > a
> > > > > total
> > > > > > of 49 commits (6,471 ++ / 3,185 --) [1]. He reported 43 Jira
> > issues,
> > > > and
> > > > > > resolved ~30 issues. Aviem improved the stability of the Spark
> > > runner a
> > > > > > lot, and introduced support for metrics. Finally, Aviem is
> > > championing
> > > > > > dependency management across the project.
> > > > > >
> > > > > > Congratulations to all four! Welcome!
> > > > > >
> > > > > > Davor
> > > > > >
> > > > > > [1]
> > > > > > https://github.com/apache/beam/graphs/contributors?from=
> > > > > > 2016-02-01=2017-03-17=c
> > > > > > [2]
> > > > > > https://github.com/apache/beam/blob/v0.6.0/sdks/python/
> > > > > > apache_beam/io/iobase.py#L70
> > > > > > [3]
> > > > > > https://github.com/apache/beam/blob/v0.6.0/sdks/python/
> > > > > > apache_beam/io/iobase.py#L561
> > > > > > [4] https://s.apache.org/splittable-do-fn
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Regards,
> > > > Neelesh S. Salian
> > > >
> > >
> >
>

Default shading configuration and opting out

2017-03-14 Thread Aviem Zur

Hi all,

https://github.com/apache/beam/pull/2096 introduced a common shading
configuration for all of the modules in the project.

The reason for this is that modules which are dependent on Guava may leak
this dependency to the user and this could conflict with the version of
Guava they require.
A common, default shading configuration ensures that if a module adds a
dependency on Guava it will be shaded and relocated to avoid these
conflicts.

This change does make the file sizes of the modules' jars go up. If there
are modules which are dependent on Guava but do not require shading and
relocating of it (i.e. the module does not pose a risk of leaking Guava to
the user) we can "opt-out" that module from default shading configuration.
To do this add the following to the module's build plugins:
https://github.com/apache/beam/pull/2096#issuecomment-286393622


P.S.

Our shading of Guava currently only saves us from non-transitive
dependencies of Guava. If a module has a transitive dependency on Guava
(And does not have an explicit non-transitive dependency on it) it will not
be shaded (Maven dependencies plugin protects us from this, but not in all
cases). We plan to fix this in
https://issues.apache.org/jira/browse/BEAM-1706

We can consider configuring minimization for our maven shade plugin to
reduce the jar sizes, but there are some issues with this. See:
https://issues.apache.org/jira/browse/BEAM-1720

Add GitHub topics to Beam repository

2017-03-09 Thread Aviem Zur

About a month ago GitHub introduced topics, which let GitHub users query
for repositories by topics (domains that the repos deal with).
We can leverage these to increase Beam's exposure on GitHub.

Example topics we could add: big-data, google-cloud-dataflow, spark, flink,
apex, gearpump
We can also add the topics which Dataflow added: data-science,
data-analysis, data-mining, data-processing

[1] https://github.com/blog/2309-introducing-topics
[2] https://github.com/GoogleCloudPlatform/DataflowJavaSDK

Re: Interest in a (virtual) contributor meeting?

2017-02-21 Thread Aviem Zur

+1

On Wed, Feb 22, 2017 at 5:45 AM Jesse Anderson 
wrote:

> Sounds good.
>
> On Tue, Feb 21, 2017, 7:19 PM Davor Bonaci  wrote:
>
> > In the early days of the project, we have held a few meetings for the
> > initial community to get to know each other. Since then, the community
> has
> > grown a huge amount, but we haven't organized any get-togethers.
> >
> > I wanted to gauge interest in a potential video conference call in the
> near
> > future. No specific agenda -- simply a chance for everyone to meet others
> > and see the faces of people we share a common passion with. Of course, an
> > open discussion on any topic of interest to the contributor community is
> > welcome. This would be strictly informal -- any decisions are reserved
> for
> > the mailing list discussions.
> >
> > If you'd be interested in attending, please reply back. If there's
> > sufficient interest, I'd be happy to try to organize something in the
> near
> > future.
> >
> > Thanks!
> >
> > Davor
> >
>

Re: Metrics for Beam IOs.

2017-02-18 Thread Aviem Zur

Is there a way to leverage runners' existing metrics sinks?
As stated by Amit & Stas, Spark runner uses Spark's metrics sink to report
Beam's aggregators and metrics.
Other runners may also have a similar capability, I'm not sure. This could
remove the need for a plugin, and dealing with push/pull.
I'm assuming we should compile a table of what can be supported in each
runner in this area and then decide a way to move forward?

On Sat, Feb 18, 2017 at 6:35 PM Jean-Baptiste Onofré 
wrote:

> Good point.
>
> In Decanter, it's what I named a "scheduled collector". So, yes, the
> adapter will periodically harvest metric to push.
>
> Regards
> JB
>
> On 02/18/2017 05:30 PM, Amit Sela wrote:
> > First issue with "push" metrics plugin - what if the runner's underlying
> > reporting mechanism is "pull" ? Codahale ScheduledReporter will sample
> the
> > values every X and send to ...
> > So any runner using a "pull-like" would use an adapter ?
> >
> > On Sat, Feb 18, 2017 at 6:27 PM Jean-Baptiste Onofré 
> > wrote:
> >
> >> Hi Ben,
> >>
> >> ok it's what I thought. Thanks for the clarification.
> >>
> >> +1 for the plugin-like "push" API (it's what I have in mind too ;)).
> >> I will start a PoC for discussion next week.
> >>
> >> Regards
> >> JB
> >>
> >> On 02/18/2017 05:17 PM, Ben Chambers wrote:
> >>> The runner can already report metrics during pipeline execution so it
> is
> >>> usable for monitoring.
> >>>
> >>> The pipeline result can be used to query metrics during pipeline
> >> execution,
> >>> so a first version of reporting to other systems is to periodically
> pulls
> >>> metrics from the runner with that API.
> >>>
> >>> We may eventually want to provide a plugin-like API to get the runner
> to
> >>> push metrics more directly to other metrics stores. This layer needs
> some
> >>> thought since it has to handle the complexity of attempted/committed
> >>> metrics to be consistent with the model.
> >>>
> >>>
> >>>
> >>> On Sat, Feb 18, 2017, 5:44 AM Jean-Baptiste Onofré 
> >> wrote:
> >>>
> >>> Hi Amit,
> >>>
> >>> before Beam, I didn't mind about portability ;) So I used the Spark
> >>> approach.
> >>>
> >>> But, now, as a Beam user, I would expect a generic way to deal with
> >>> metric whatever the runner would be.
> >>>
> >>> Today, you are right: I'm using the solution provided by the execution
> >>> engine. That's the current approach and it works fine. And it's up to
> me
> >>> to leverage (for intance Accumulators) it with my own system.
> >>>
> >>> My thought is more to provide a generic way. It's only a discussion for
> >>> now ;)
> >>>
> >>> Regards
> >>> JB
> >>>
> >>> On 02/18/2017 02:38 PM, Amit Sela wrote:
>  On Sat, Feb 18, 2017 at 10:16 AM Jean-Baptiste Onofré <
> j...@nanthrax.net>
>  wrote:
> 
> > Hi Amit,
> >
> > my point is: how do we provide metric today to end user and how can
> >> they
> > use it to monitor a running pipeline ?
> >
> > Clearly the runner is involved, but, it should behave the same way
> for
> > all runners. Let me take an example.
> > On my ecosystem, I'm using both Flink and Spark with Beam, some
> > pipelines on each. I would like to get the metrics for all pipelines
> to
> > my monitoring backend. If I can "poll" from the execution engine
> metric
> > backend to my system that's acceptable, but it's an overhead of work.
> > Having a generic metric reporting layer would allow us to have a more
> > common way. If the user doesn't provide any reporting sink, then we
> use
> > the execution backend metric layer. If provided, we use the reporting
> >>> sink.
> >
>  How did you do it before Beam ? I that for Spark you reported it's
> >> native
>  metrics via Codahale Reporter and Accumulators were visible in the UI,
> >> and
>  the Spark runner took it a step forward to make it all visible via
>  Codahale. Assuming Flink does something similar, it all belongs to
> >> runner
>  setup/configuration.
> 
> >
> > About your question: you are right, it's possible to update a
> collector
> > or appender without impacting anything else.
> >
> > Regards
> > JB
> >
> > On 02/17/2017 10:38 PM, Amit Sela wrote:
> >> @JB I think what you're suggesting is that Beam should provide a
> >>> "Metrics
> >> Reporting" API as well, and I used to think like you, but the more I
> >> thought of that the more I tend to disagree now.
> >>
> >> The SDK is for users to author pipelines, so Metrics are for
> >>> user-defined
> >> metrics (in contrast to runner metrics).
> >>
> >> The Runner API is supposed to help different backends to integrate
> >> with
> >> Beam to allow users to execute those pipeline on their favourite
> > backend. I
> >> believe the Runner API has to provide restrictions/demands that are
> >> just
> >> enough so a runner could execute a Beam

Re: Metrics for Beam IOs.

2017-02-14 Thread Aviem Zur

Hi Ismaël,

You've raised some great points.
Please see my comments inline.

On Tue, Feb 14, 2017 at 3:37 PM Ismaël Mejía  wrote:

> Hello,
>
> The new metrics API allows us to integrate some basic metrics into the Beam
> IOs. I have been following some discussions about this on JIRAs/PRs, and I
> think it is important to discuss the subject here so we can have more
> awareness and obtain ideas from the community.
>
> First I want to thank Ben for his work on the metrics API, and Aviem for
> his ongoing work on metrics for IOs, e.g. KafkaIO) that made me aware of
> this subject.
>
> There are some basic ideas to discuss e.g.
>
> - What are the responsibilities of Beam IOs in terms of Metrics
> (considering the fact that the actual IOs, server + client, usually provide
> their own)?
>

While it is true that many IOs provide their own metrics, I think that Beam
should expose IO metrics because:

   1. Metrics which help understanding performance of a pipeline which uses
   an IO may not be covered by the IO .
   2. Users may not be able to setup integrations with the IO's metrics to
   view them effectively (And correlate them to a specific Beam pipeline), but
   still want to investigate their pipeline's performance.

> - What metrics are relevant to the pipeline (or some particular IOs)? Kafka
> backlog for one could point that a pipeline is behind ingestion rate.

I think it depends on the IO, but there is probably overlap in some of the
metrics so a guideline might be written for this.
I listed what I thought should be reported for KafkaIO in the following
JIRA: https://issues.apache.org/jira/browse/BEAM-1398
Feel free to add more metrics you think are important to report.

>
>
- Should metrics be calculated on IOs by default or no?
> - If metrics are defined by default does it make sense to allow users to
> disable them?
>

IIUC, your concern is that metrics will add overhead to the pipeline, and
pipelines which are highly sensitive to this will be hampered?
In any case I think that yes, metrics calculation should be configurable
(Enable/disable).
In Spark runner, for example the Metrics sink feature (not the metrics
calculation itself, but sinks to send them to) is configurable in the
pipeline options.

> Well these are just some questions around the subject so we can create a
> common set of practices to include metrics in the IOs and eventually
> improve the transform guide with this. What do you think about this? Do you
> have other questions/ideas?
>
> Thanks,
> Ismaël
>

Re: Projects for Google Summer of Code 2017

2017-02-11 Thread Aviem Zur

Kenn that scholarly documents project sounds awesome.

On Fri, Feb 3, 2017 at 11:48 PM Kenneth Knowles 
wrote:

> In fact, I have just learned that our deadline to file project _is_
> February 9th. Having good ideas is part of the ASF's application process.
>
> Here's a TL;DR of the instructions:
>
> 0. ASF members and committers can be mentors; find one for any project
> idea.
> 1. Mentors: understand what it means to be a mentor [1].
> 2. Mentors: create a JIRA issue for each idea:
> 2A. assign to the mentor.
> 2B. label with "gsoc2017" and "mentor"; a JIRA search for these is at [2].
> 2C. label with _prerequisites_ such as programming language, tools, area.
>
> More info about ASF+GSOC is at [3].
>
> Kenn
>
> [1] http://community.apache.org/guide-to-being-a-mentor.html
> [2] http://s.apache.org/gsoc2017ideas
> [3] http://community.apache.org/gsoc.html
>
> On Wed, Feb 1, 2017 at 12:03 PM, Pablo Estrada  >
> wrote:
>
> > I believe that Beam falls within the umbrella of the Apache Software
> > Foundation. All we'd need to do is register mentors for projects [1][4],
> > and create JIRA issues with the appropriate labels [2]. So, instead our
> > deadline for the project proposal is on the day when mentoring
> > organizations are announced (Feb 27) [3].
> >
> > [1].
> > https://community.apache.org/gsoc.html#prospective-asf-mentors-read-this
> > [2].
> > https://community.apache.org/use-the-comdev-issue-tracker-
> > for-gsoc-tasks.html
> > [3]. https://developers.google.com/open-source/gsoc/timeline
> > [4]. https://community.apache.org/guide-to-being-a-mentor.html
> >
> > On Tue, Jan 31, 2017 at 7:03 AM Kenneth Knowles 
> > wrote:
> >
> > > I think this is a great idea. I also participated in GSOC once.
> > >
> > > I've been particularly interested in coming up with great new
> > applications
> > > of Beam to new domains. In chatting with professors at the University
> of
> > > Washington, I've learned that scholars of many fields would really like
> > to
> > > explore new and highly customized ways of processing the growing body
> of
> > > publicly-available scholarly documents. This seems like a great
> project,
> > > since we love doing this to Shakespeare's works, and there are
> thousands
> > of
> > > times as many public articles so there's non-toy scale issues. And yet,
> > it
> > > does seem like it can be scoped appropriately.
> > >
> > > The deadline for a mentoring organization is Feb 9 so let's put
> together
> > a
> > > proposal!
> > >
> > > Kenn
> > >
> > > On Fri, Jan 13, 2017 at 3:25 PM, Pablo Estrada
> >  > > >
> > > wrote:
> > >
> > > > Hi there,
> > > > The GSOC 2017 [1] is coming soon. I figured it would be nice if we
> > could
> > > > find small projects that a student could implement this summer.
> Apache
> > > > already takes part in this, and all we'd need to do is label Jira
> > issues
> > > as
> > > > GSOC projects. Any ideas for projects?
> > > >
> > > > As a note, during my grad school I participated in GSOC a couple of
> > times
> > > > and I'd say they were some of my most rewarding development
> > experiences.
> > > >
> > > > [1] - https://developers.google.com/open-source/gsoc/
> > > >
> > >
> >
>

Re: Better developer instructions for using Maven?

2017-02-10 Thread Aviem Zur

Opened JIRA ticket: https://issues.apache.org/jira/browse/BEAM-1457

On Fri, Feb 10, 2017 at 4:54 PM Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

> Yeah. Agree. Time extend is not huge and it's worth to add it in verify
> phase.
>
> Regards
> JB
>
> On Feb 10, 2017, 10:13, at 10:13, Aviem Zur <aviem...@gmail.com> wrote:
> >This goes back to the original discussion in this thread - reduce the
> >amount of things pull requesters should know and keep the maven command
> >in
> >the PR checklist as: 'mvn clean verify'.
> >
> >So if rat and findbugs do not take that long to run I think they should
> >be
> >run by 'mvn clean verify'
> >
> >I ran a quick test on my laptop to see how much time they add to the
> >build
> >(of the entire project):
> >
> >'mvn clean install -DskipTests' => Total time: 03:51 min
> >'mvn clean install apache-rat:check findbugs:check -DskipTests'  =>
> >Total
> >time: 05:29 min (Added 01:38 min)
> >'mvn clean install' => Total time: 09:37 min
> >'mvn clean install apache-rat:check findbugs:check' => Total time:
> >11:13
> >min (Added 01:36 min)
> >
> >Are these times reasonable enough to add rat and findbugs to the
> >default
> >build?
> >
> >On Fri, Feb 10, 2017 at 1:55 PM Jean-Baptiste Onofré <j...@nanthrax.net>
> >wrote:
> >
> >> Hi
> >>
> >> We discussed about that at the beginning of the project. We agreed to
> >> execute rat and findbugs in a specific profile to reduce the build
> >time for
> >> dev.
> >>
> >> That's why I do mvn clean install -Prelease before submitting a PR
> >and
> >> just clean install when I'm developing.
> >>
> >> No problem to change that.
> >>
> >> Regards
> >> JB
> >>
> >> On Feb 10, 2017, 07:51, at 07:51, Aviem Zur <aviem...@gmail.com>
> >wrote:
> >> >Can we consider adding rat-plugin and findbugs to the default verify
> >> >phase?
> >> >Currently they only run when the `release` profile is enabled.
> >> >
> >> >On Thu, Jan 26, 2017 at 11:42 AM Aljoscha Krettek
> ><aljos...@apache.org>
> >> >wrote:
> >> >
> >> >> +1 to what Dan said
> >> >>
> >> >> On Wed, 25 Jan 2017 at 21:40 Kenneth Knowles
> ><k...@google.com.invalid>
> >> >> wrote:
> >> >>
> >> >> > +1
> >> >> >
> >> >> > On Jan 25, 2017 11:15, "Jean-Baptiste Onofré" <j...@nanthrax.net>
> >> >wrote:
> >> >> >
> >> >> > > +1
> >> >> > >
> >> >> > > It sounds good to me.
> >> >> > >
> >> >> > > Thanks Dan !
> >> >> > >
> >> >> > > Regards
> >> >> > > JB⁣
> >> >> > >
> >> >> > > On Jan 25, 2017, 19:39, at 19:39, Dan Halperin
> >> >> > <dhalp...@google.com.INVALID>
> >> >> > > wrote:
> >> >> > > >Here is my summary of the threads:
> >> >> > > >
> >> >> > > >Overwhelming agreement:
> >> >> > > >
> >> >> > > >- rename `release` to something more appropriate.
> >> >> > > >- add `checkstyle` to the default build (it's basically a
> >> >compile
> >> >> > > >error)
> >> >> > > >- add more information to contributor guide
> >> >> > > >
> >> >> > > >Reasonable agreement
> >> >> > > >
> >> >> > > >- don't update the github instructions to make passing `mvn
> >> >verify
> >> >> > > >-P >> >> > > >checks>` mandatory. Maybe add a hint that this is a good
> >proxy
> >> >for
> >> >> what
> >> >> > > >Jenkins will run.
> >> >> > > >
> >> >> > > >Unresolved:
> >> >> > > >
> >> >> > > >- whether all checks should be in `mvn verify`
> >> >> > > >- whether `mvn test` is useful for most workflows
> >> >> > > >
> >> >> > > >I'll propose to proceed with the overwhelmingly agreed-upon
> >> >changes,
> >> >> >

Issue with Coder documentation regarding context

2017-02-09 Thread Aviem Zur

Hi,

I think improvements can be made to the documentation of `encode` and
`decode` methods in `Coder`.

A coder may be used to encode/decode several objects using a single stream,
you cannot assume that the stream the coder encodes to/decodes from only
contains bytes representing a single object. For example, when the coder is
used in an `IterableCoder`, for example in `GroupByKey`.

When implementing a coder this needs to be taken into account.

The `context` argument in `encode` and `decode` methods provides the
necessary information.

The existing documentation for these methods does not seem to cover this.
If users are not aware of this when implementing these methods it can cause
errors or skewed results.

See:
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/Coder.java#L126
and:
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/Coder.java#L137

This is partially addressed in the documentation of the static `Context`
values:
`OUTER`:
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/Coder.java#L72
and `NESTED`:
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/Coder.java#L79

However, I think that the documentation of `encode` and `decode` should
explain this concept clearly, to avoid confusing users implementing coders.

Re: PTransform style guide PR

2017-02-07 Thread Aviem Zur

Very well written.
Examples for every concept make it very easily relatable and understandable.

On Tue, Jan 31, 2017 at 3:52 AM Eugene Kirpichov
 wrote:

> I don't think I'll have capacity to review every PR that brings particular
> Beam transforms in accordance with the style guide - but I'm happy to
> review some of them and participate in discussions of potentially more
> controversial changes. In my ideal world, this task would be crowdsourced
> (I sent out a separate thread about that).
>
> On Mon, Jan 30, 2017 at 5:43 PM Jesse Anderson 
> wrote:
>
> > Thanks for putting that together. Does this mean you've volunteered to
> > referee bikeshedding?
> >
> > On Mon, Jan 30, 2017 at 5:21 PM Eugene Kirpichov
> >  wrote:
> >
> > > The initial PR has been merged and the style guide is live
> > > https://beam.apache.org/contribute/ptransform-style-guide/ - let us
> > > continue discussing and tweaking on this thread and via smaller PRs
> > > modifying the document.
> > >
> > > On Mon, Jan 30, 2017 at 7:50 AM Aljoscha Krettek 
> > > wrote:
> > >
> > > > Wow, that's a long read. But quite informative +1
> > > >
> > > > On Sat, 28 Jan 2017 at 06:54 Jean-Baptiste Onofré 
> > > wrote:
> > > >
> > > > > Hi Eugene,
> > > > >
> > > > > As said in the PR: great work and thanks a lot !
> > > > >
> > > > > I will take a complete look during the week end. I'm pretty sure
> > it's a
> > > > > great guide as it's basically the result of our discussions and
> > reviews
> > > > ;)
> > > > >
> > > > > Thanks again !
> > > > > Regards
> > > > > JB
> > > > >
> > > > > On 01/28/2017 06:21 AM, Eugene Kirpichov wrote:
> > > > > > Hello all,
> > > > > >
> > > > > > I just sent a pull request with a style guide for developers of
> new
> > > > > > PTransforms - intended for library writers, e.g. people who
> > > contribute
> > > > > new
> > > > > > connectors and other transforms to Beam. The guide is mainly
> based
> > on
> > > > > > experience from reviewing connectors contributed by JB and
> others,
> > > but
> > > > > it's
> > > > > > intended to be generally applicable.
> > > > > >
> > > > > > It covers a variety of points - from code organization, to
> overall
> > > API
> > > > > > design, to error handling and so on. I expect most of it to be
> > > > > > non-controversial and just reflect the style of existing
> transforms
> > > in
> > > > > Beam
> > > > > > - however all of it is, of course, up to debate.
> > > > > >
> > > > > > https://github.com/apache/beam-site/pull/134/
> > > > > >
> > > > > > I'm hoping that this documentation will help guide new transform
> > > > authors
> > > > > in
> > > > > > the right direction from the start, as well as make the job of
> > > > reviewers
> > > > > > easier by providing a source they can link to and helping focus
> the
> > > > > review
> > > > > > on resolving more ambiguous points.
> > > > > >
> > > > > > (Note that, like all other documentation, this will evolve, so
> the
> > > goal
> > > > > of
> > > > > > the current PR is not to be complete, but to be a starting point)
> > > > > >
> > > > > > When the guide is ratified, I think it'll make sense to file
> JIRAs
> > to
> > > > > bring
> > > > > > Beam in accordance with it - there are a few transforms that were
> > > > written
> > > > > > before the best practices shaped up.
> > > > > >
> > > > > > Thanks!
> > > > > >
> > > > >
> > > > > --
> > > > > Jean-Baptiste Onofré
> > > > > jbono...@apache.org
> > > > > http://blog.nanthrax.net
> > > > > Talend - http://www.talend.com
> > > > >
> > > >
> > >
> >
>

Re: TextIO binary file

2017-02-05 Thread Aviem Zur

AvroIO would is great for POJOs. But for use cases with more complex,
serializable objects, or objects which are compatible with some coder it
falls short.

Also, for less savvy users to know they need to use AvroIO might be a
stretch.
Some simpler API along the the lines of ObjectFile might be more user
friendly (even if for optimization it uses avro under the hood for POJOs).

On Sun, Feb 5, 2017, 22:00 Eugene Kirpichov <kirpic...@google.com.invalid>
wrote:

> OK, I see what you mean; however I still think this can be solved without
> introducing a new "Beam object file" (or whatever) file format, and without
> thereby introducing additional use cases and compatibility constraints on
> coders.
>
> I asked before in the thread why not just use AvroIO (it can serialize
> arbitrary POJOs using reflection); I skimmed the thread it doesn't seem
> like that got answered properly. I also like Dan's suggestion to use AvroIO
> to serialize byte[] arrays and you can do whatever you want with them (e.g.
> use another serialization library, say, Kryo, or Java serialization, etc.)
>
> On Sun, Feb 5, 2017 at 11:37 AM Aviem Zur <aviem...@gmail.com> wrote:
>
> > I agree that these files will serve no use outside of Beam pipelines.
> >
> > The rationale was that you might want to have one pipeline write output
> to
> > files and then have a different pipeline that uses those files as inputs.
> >
> > Say one team in your organization creates a pipeline and a different team
> > utilizes those files as input for a different pipeline. The contract
> > between them is the file, in a Beam-readable format.
> > This is similar to Spark's `saveAsObjectFile` https://github.com/apache/
> >
> spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
> > <
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
> >
> >
> > The merit for something like this in my eyes is to not burden the user
> with
> > writing a custom IO
> >
> > On Tue, Jan 31, 2017 at 10:23 PM Eugene Kirpichov
> > <kirpic...@google.com.invalid> wrote:
> >
> > +1 to Robert. Either this will be a Beam-specific file format (and then
> > nothing except Beam will be able to read it - which I doubt is what you
> > want), or it is an existing well-known file format and then we should
> just
> > develop an IO for it.
> > Note that any file format that involves encoding elements with a Coder is
> > Beam-specific, because wire format of coders is Beam-specific.
> >
> > On Tue, Jan 31, 2017 at 12:20 PM Robert Bradshaw
> > <rober...@google.com.invalid> wrote:
> >
> > > On Tue, Jan 31, 2017 at 12:04 PM, Aviem Zur <aviem...@gmail.com>
> wrote:
> > > > +1 on what Stas said.
> > > > I think there is value in not having the user write a custom IO for a
> > > > protocol they use which is not covered by Beam IOs. Plus having them
> > deal
> > > > with not only the encoding but also the IO part is not ideal.
> > > > I think having a basic FileIO that can write to the Filesystems
> > supported
> > > > by Beam (GS/HDFS/Local/...) which you can use any coder with,
> including
> > > > your own custom coder, can be beneficial.
> > >
> > > What would the format of the file be? Just the concatenation of the
> > > elements encoded according to the coder? Or is there a delimiter
> > > needed to separate records. In which case how does one ensure the
> > > delimiter does not also appear in the middle of an encoded element? At
> > > this point you're developing a file format, and might as well stick
> > > with one of the standard ones. https://xkcd.com/927
> > >
> > > > On Tue, Jan 31, 2017 at 7:56 PM Stas Levin <stasle...@gmail.com>
> > wrote:
> > > >
> > > > I believe the motivation is to have an abstraction that allows one to
> > > write
> > > > stuff to a file in a way that is agnostic to the coder.
> > > > If one needs to write a non-Avro protocol to a file, and this
> > particular
> > > > protocol does not meet the assumption made by TextIO, one might need
> to
> > > > duplicate the file IO related code from AvroIO.
> > > >
> > > > On Tue, Jan 31, 2017 at 6:50 PM Eugene Kirpichov
> > > > <kirpic...@google.com.invalid> wrote:
> > > >
> > > >> Could you clarify why it would be useful to write objects to files
> > using
> > > >> Beam coders, as opposed to just using e.g. AvroIO

Re: TextIO binary file

2017-02-05 Thread Aviem Zur

I agree that these files will serve no use outside of Beam pipelines.

The rationale was that you might want to have one pipeline write output to
files and then have a different pipeline that uses those files as inputs.

Say one team in your organization creates a pipeline and a different team
utilizes those files as input for a different pipeline. The contract
between them is the file, in a Beam-readable format.
This is similar to Spark's `saveAsObjectFile` https://github.com/apache/
spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512

The merit for something like this in my eyes is to not burden the user with
writing a custom IO

On Tue, Jan 31, 2017 at 10:23 PM Eugene Kirpichov
<kirpic...@google.com.invalid> wrote:

+1 to Robert. Either this will be a Beam-specific file format (and then
nothing except Beam will be able to read it - which I doubt is what you
want), or it is an existing well-known file format and then we should just
develop an IO for it.
Note that any file format that involves encoding elements with a Coder is
Beam-specific, because wire format of coders is Beam-specific.

On Tue, Jan 31, 2017 at 12:20 PM Robert Bradshaw
<rober...@google.com.invalid> wrote:

> On Tue, Jan 31, 2017 at 12:04 PM, Aviem Zur <aviem...@gmail.com> wrote:
> > +1 on what Stas said.
> > I think there is value in not having the user write a custom IO for a
> > protocol they use which is not covered by Beam IOs. Plus having them
deal
> > with not only the encoding but also the IO part is not ideal.
> > I think having a basic FileIO that can write to the Filesystems
supported
> > by Beam (GS/HDFS/Local/...) which you can use any coder with, including
> > your own custom coder, can be beneficial.
>
> What would the format of the file be? Just the concatenation of the
> elements encoded according to the coder? Or is there a delimiter
> needed to separate records. In which case how does one ensure the
> delimiter does not also appear in the middle of an encoded element? At
> this point you're developing a file format, and might as well stick
> with one of the standard ones. https://xkcd.com/927
>
> > On Tue, Jan 31, 2017 at 7:56 PM Stas Levin <stasle...@gmail.com> wrote:
> >
> > I believe the motivation is to have an abstraction that allows one to
> write
> > stuff to a file in a way that is agnostic to the coder.
> > If one needs to write a non-Avro protocol to a file, and this particular
> > protocol does not meet the assumption made by TextIO, one might need to
> > duplicate the file IO related code from AvroIO.
> >
> > On Tue, Jan 31, 2017 at 6:50 PM Eugene Kirpichov
> > <kirpic...@google.com.invalid> wrote:
> >
> >> Could you clarify why it would be useful to write objects to files
using
> >> Beam coders, as opposed to just using e.g. AvroIO?
> >>
> >> Coders (should) make no promise as to what their wire format is, so
such
> >> files could be read back only by other Beam pipelines using the same
IO.
> >>
> >> On Tue, Jan 31, 2017 at 2:48 AM Aviem Zur <aviem...@gmail.com> wrote:
> >>
> >> > So If I understand the general agreement is that TextIO should not
> >> support
> >> > anything but lines from files as strings.
> >> > I'll go ahead and file a ticket that says the Javadoc should be
> changed
> >> to
> >> > reflect this and `withCoder` method should be removed.
> >> >
> >> > Is there merit for Beam to supply an IO which does allow writing
> objects
> >> to
> >> > a file using Beam coders and Beam FS (To write these files to
> >> > GS/Hadoop/Local)?
> >> >
> >> > On Tue, Jan 31, 2017 at 2:28 AM Eugene Kirpichov
> >> > <kirpic...@google.com.invalid> wrote:
> >> >
> >> > P.S. Note that this point (about coders) is also mentioned in the
> >> > now-being-reviewed PTransform Style Guide
> >> > https://github.com/apache/beam-site/pull/134
> >> > currently staged at
> >> >
> >> >
> >>
> >
>
http://apache-beam-website-pull-requests.storage.googleapis.com/134/contribute/ptransform-style-guide/index.html#coders
> >> >
> >> >
> >> > On Mon, Jan 30, 2017 at 4:25 PM Chamikara Jayalath <
> chamik...@apache.org
> >> >
> >> > wrote:
> >> >
> >> > > +1 to what Eugene said.
> >> > >
> >> > > I've seen a number of Python SDK users incorrectly assuming that
> >> > > coder.decode() is needed when developing their own file-based
> sources
> >> > >

Re: TextIO binary file

2017-01-31 Thread Aviem Zur

+1 on what Stas said.
I think there is value in not having the user write a custom IO for a
protocol they use which is not covered by Beam IOs. Plus having them deal
with not only the encoding but also the IO part is not ideal.
I think having a basic FileIO that can write to the Filesystems supported
by Beam (GS/HDFS/Local/...) which you can use any coder with, including
your own custom coder, can be beneficial.

On Tue, Jan 31, 2017 at 7:56 PM Stas Levin <stasle...@gmail.com> wrote:

I believe the motivation is to have an abstraction that allows one to write
stuff to a file in a way that is agnostic to the coder.
If one needs to write a non-Avro protocol to a file, and this particular
protocol does not meet the assumption made by TextIO, one might need to
duplicate the file IO related code from AvroIO.

On Tue, Jan 31, 2017 at 6:50 PM Eugene Kirpichov
<kirpic...@google.com.invalid> wrote:

> Could you clarify why it would be useful to write objects to files using
> Beam coders, as opposed to just using e.g. AvroIO?
>
> Coders (should) make no promise as to what their wire format is, so such
> files could be read back only by other Beam pipelines using the same IO.
>
> On Tue, Jan 31, 2017 at 2:48 AM Aviem Zur <aviem...@gmail.com> wrote:
>
> > So If I understand the general agreement is that TextIO should not
> support
> > anything but lines from files as strings.
> > I'll go ahead and file a ticket that says the Javadoc should be changed
> to
> > reflect this and `withCoder` method should be removed.
> >
> > Is there merit for Beam to supply an IO which does allow writing objects
> to
> > a file using Beam coders and Beam FS (To write these files to
> > GS/Hadoop/Local)?
> >
> > On Tue, Jan 31, 2017 at 2:28 AM Eugene Kirpichov
> > <kirpic...@google.com.invalid> wrote:
> >
> > P.S. Note that this point (about coders) is also mentioned in the
> > now-being-reviewed PTransform Style Guide
> > https://github.com/apache/beam-site/pull/134
> > currently staged at
> >
> >
>
http://apache-beam-website-pull-requests.storage.googleapis.com/134/contribute/ptransform-style-guide/index.html#coders
> >
> >
> > On Mon, Jan 30, 2017 at 4:25 PM Chamikara Jayalath <chamik...@apache.org
> >
> > wrote:
> >
> > > +1 to what Eugene said.
> > >
> > > I've seen a number of Python SDK users incorrectly assuming that
> > > coder.decode() is needed when developing their own file-based sources
> > > (since many users usually refer to text source first). Probably coder
> > > parameter should not be configurable for text source/sink and they
> should
> > > be updated to only read/write UTF-8 encoded strings.
> > >
> > > - Cham
> > >
> > > On Mon, Jan 30, 2017 at 3:38 PM Eugene Kirpichov
> > > <kirpic...@google.com.invalid> wrote:
> > >
> > > > The use of Coder in TextIO is a long standing design issue because
> > coders
> > > > are not intended to be used for general purpose converting things
> from
> > > and
> > > > to bytes, their only proper use is letting the runner materialize
and
> > > > restore objects if the runner thinks it's necessary. IMO it should
> have
> > > > been called LineIO, document that it reads lines of text as String,
> and
> > > not
> > > > have a withCoder parameter at all.
> > > >
> > > > The proper way to address your use case is to write a custom
> > > > FileBasedSource.
> > > > On Mon, Jan 30, 2017 at 2:52 AM Aviem Zur <aviem...@gmail.com>
> wrote:
> > > >
> > > > > The Javadoc of TextIO states:
> > > > >
> > > > > * By default, {@link TextIO.Read} returns a {@link PCollection}
> of
> > > > > {@link String Strings},
> > > > >  * each corresponding to one line of an input UTF-8 text file. To
> > > convert
> > > > > directly from the raw
> > > > >  * bytes (split into lines delimited by '\n', '\r', or '\r\n') to
> > > another
> > > > > object of type {@code T},
> > > > >  * supply a {@code Coder} using {@link
> > > TextIO.Read#withCoder(Coder)}.
> > > > >
> > > > > However, as I stated, `withCoder` doesn't seem to have tests, and
> > > > probably
> > > > > won't work given the hard-coded '\n' delimiter.
> > > > >
> > > > > On Mon, Jan 30, 2017 at 12:18 PM Jean-Baptiste Onofré <
> > j...@nanthrax.net
> > > >
> > > > > wrote:
&

Re: TextIO binary file

2017-01-31 Thread Aviem Zur

So If I understand the general agreement is that TextIO should not support
anything but lines from files as strings.
I'll go ahead and file a ticket that says the Javadoc should be changed to
reflect this and `withCoder` method should be removed.

Is there merit for Beam to supply an IO which does allow writing objects to
a file using Beam coders and Beam FS (To write these files to
GS/Hadoop/Local)?

On Tue, Jan 31, 2017 at 2:28 AM Eugene Kirpichov
<kirpic...@google.com.invalid> wrote:

P.S. Note that this point (about coders) is also mentioned in the
now-being-reviewed PTransform Style Guide
https://github.com/apache/beam-site/pull/134
currently staged at
http://apache-beam-website-pull-requests.storage.googleapis.com/134/contribute/ptransform-style-guide/index.html#coders


On Mon, Jan 30, 2017 at 4:25 PM Chamikara Jayalath <chamik...@apache.org>
wrote:

> +1 to what Eugene said.
>
> I've seen a number of Python SDK users incorrectly assuming that
> coder.decode() is needed when developing their own file-based sources
> (since many users usually refer to text source first). Probably coder
> parameter should not be configurable for text source/sink and they should
> be updated to only read/write UTF-8 encoded strings.
>
> - Cham
>
> On Mon, Jan 30, 2017 at 3:38 PM Eugene Kirpichov
> <kirpic...@google.com.invalid> wrote:
>
> > The use of Coder in TextIO is a long standing design issue because
coders
> > are not intended to be used for general purpose converting things from
> and
> > to bytes, their only proper use is letting the runner materialize and
> > restore objects if the runner thinks it's necessary. IMO it should have
> > been called LineIO, document that it reads lines of text as String, and
> not
> > have a withCoder parameter at all.
> >
> > The proper way to address your use case is to write a custom
> > FileBasedSource.
> > On Mon, Jan 30, 2017 at 2:52 AM Aviem Zur <aviem...@gmail.com> wrote:
> >
> > > The Javadoc of TextIO states:
> > >
> > > * By default, {@link TextIO.Read} returns a {@link PCollection} of
> > > {@link String Strings},
> > >  * each corresponding to one line of an input UTF-8 text file. To
> convert
> > > directly from the raw
> > >  * bytes (split into lines delimited by '\n', '\r', or '\r\n') to
> another
> > > object of type {@code T},
> > >  * supply a {@code Coder} using {@link
> TextIO.Read#withCoder(Coder)}.
> > >
> > > However, as I stated, `withCoder` doesn't seem to have tests, and
> > probably
> > > won't work given the hard-coded '\n' delimiter.
> > >
> > > On Mon, Jan 30, 2017 at 12:18 PM Jean-Baptiste Onofré <j...@nanthrax.net
> >
> > > wrote:
> > >
> > > > Hi Aviem,
> > > >
> > > > TextIO is not designed to write/read binary file: it's pure Text, so
> > > > String.
> > > >
> > > > Regards
> > > > JB
> > > >
> > > > On 01/30/2017 09:24 AM, Aviem Zur wrote:
> > > > > Hi,
> > > > >
> > > > > While trying to use TextIO to write/read a binary file rather than
> > > String
> > > > > lines from a textual file I ran into an issue - the delimiter
> TextIO
> > > uses
> > > > > seems to be hardcoded '\n'.
> > > > > See `findSeparatorBounds` -
> > > > >
> > > >
> > >
> >
>
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1024
> > > > >
> > > > > The use case is to have a file of objects, encoded into bytes
> using a
> > > > > coder. However, '\n' is not a good delimiter here, as you can
> > imagine.
> > > > > A similar pattern is found in Spark's `saveAsObjectFile`
> > > > >
> > > >
> > >
> >
>
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
> > > > > where
> > > > > they use a more appropriate delimiter, to avoid such issues.
> > > > >
> > > > > I did not find any unit tests which use TextIO to read anything
> other
> > > > than
> > > > > Strings.
> > > > >
> > > >
> > > > --
> > > > Jean-Baptiste Onofré
> > > > jbono...@apache.org
> > > > http://blog.nanthrax.net
> > > > Talend - http://www.talend.com
> > > >
> > >
> >
>

Re: TextIO binary file

2017-01-30 Thread Aviem Zur

The Javadoc of TextIO states:

* By default, {@link TextIO.Read} returns a {@link PCollection} of
{@link String Strings},
 * each corresponding to one line of an input UTF-8 text file. To convert
directly from the raw
 * bytes (split into lines delimited by '\n', '\r', or '\r\n') to another
object of type {@code T},
 * supply a {@code Coder} using {@link TextIO.Read#withCoder(Coder)}.

However, as I stated, `withCoder` doesn't seem to have tests, and probably
won't work given the hard-coded '\n' delimiter.

On Mon, Jan 30, 2017 at 12:18 PM Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

> Hi Aviem,
>
> TextIO is not designed to write/read binary file: it's pure Text, so
> String.
>
> Regards
> JB
>
> On 01/30/2017 09:24 AM, Aviem Zur wrote:
> > Hi,
> >
> > While trying to use TextIO to write/read a binary file rather than String
> > lines from a textual file I ran into an issue - the delimiter TextIO uses
> > seems to be hardcoded '\n'.
> > See `findSeparatorBounds` -
> >
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1024
> >
> > The use case is to have a file of objects, encoded into bytes using a
> > coder. However, '\n' is not a good delimiter here, as you can imagine.
> > A similar pattern is found in Spark's `saveAsObjectFile`
> >
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
> > where
> > they use a more appropriate delimiter, to avoid such issues.
> >
> > I did not find any unit tests which use TextIO to read anything other
> than
> > Strings.
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

TextIO binary file

2017-01-30 Thread Aviem Zur

Hi,

While trying to use TextIO to write/read a binary file rather than String
lines from a textual file I ran into an issue - the delimiter TextIO uses
seems to be hardcoded '\n'.
See `findSeparatorBounds` -
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1024

The use case is to have a file of objects, encoded into bytes using a
coder. However, '\n' is not a good delimiter here, as you can imagine.
A similar pattern is found in Spark's `saveAsObjectFile`
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
where
they use a more appropriate delimiter, to avoid such issues.

I did not find any unit tests which use TextIO to read anything other than
Strings.

Pipeline graph reflection

2017-01-29 Thread Aviem Zur

Hi all,

While working on implementing metrics support in the Spark Runner a need
arose for composing a unique identifier of a transform, to differentiate it
from other transforms with the same name.

With the help of @bjchambers I understood that something similar to this
exists in the Dataflow Runner which creates a string that is something
along the lines of
"PBegin/SomeInputTransform/SomeParDo/...MyTransform.#Running_number_for_collisions".

I'm trying to figure out:
A) How this is done in Dataflow runner.
B) Can be pulled up as a util for other runners, as conversation regarding
metrics API and querying is hinting this will be needed.
C) From my own forays into the code I came across
`org.apache.beam.sdk.values.PValue#getProducingTransformInternal` which can
be recursed on but is marked as deprecated. Are there efforts being made
elsewhere for this sort of pipeline graph reflection?

Re: [ANNOUNCEMENT] New committers, January 2017 edition!

2017-01-26 Thread Aviem Zur

Congrats!

On Fri, Jan 27, 2017, 06:25 Thomas Weise  wrote:

> Congrats!
>
>
> On Thu, Jan 26, 2017 at 7:49 PM, María García Herrero <
> mari...@google.com.invalid> wrote:
>
> > Congratulations and thank you for your contributions thus far!
> >
> > On Thu, Jan 26, 2017 at 6:00 PM, Robert Bradshaw <
> > rober...@google.com.invalid> wrote:
> >
> > > Welcome and congratulations!
> > >
> > > On Thu, Jan 26, 2017 at 5:05 PM, Sourabh Bajaj
> > >  wrote:
> > > > Congrats!!
> > > >
> > > > On Thu, Jan 26, 2017 at 5:02 PM Jason Kuster  .
> > > invalid>
> > > > wrote:
> > > >
> > > >> Congrats all! Very exciting. :)
> > > >>
> > > >> On Thu, Jan 26, 2017 at 4:48 PM, Jesse Anderson <
> > je...@smokinghand.com>
> > > >> wrote:
> > > >>
> > > >> > Welcome!
> > > >> >
> > > >> > On Thu, Jan 26, 2017, 7:27 PM Davor Bonaci 
> > wrote:
> > > >> >
> > > >> > > Please join me and the rest of Beam PMC in welcoming the
> following
> > > >> > > contributors as our newest committers. They have significantly
> > > >> > contributed
> > > >> > > to the project in different ways, and we look forward to many
> more
> > > >> > > contributions in the future.
> > > >> > >
> > > >> > > * Stas Levin
> > > >> > > Stas has contributed across the breadth of the project, from the
> > > Spark
> > > >> > > runner to the core pieces and Java SDK. Looking at code
> > > contributions
> > > >> > > alone, he authored 43 commits and reported 25 issues. Stas is
> very
> > > >> active
> > > >> > > on the mailing lists too, contributing to good discussions and
> > > >> proposing
> > > >> > > improvements to the Beam model.
> > > >> > >
> > > >> > > * Ahmet Altay
> > > >> > > Ahmet is a major contributor to the Python SDK, both in terms of
> > > design
> > > >> > and
> > > >> > > code contribution. Looking at code contributions alone, he
> > authored
> > > 98
> > > >> > > commits and reviewed dozens of pull requests. With Python SDK’s
> > > >> imminent
> > > >> > > merge to the master branch, Ahmet contributed towards
> > establishing a
> > > >> new
> > > >> > > major component in Beam.
> > > >> > >
> > > >> > > * Pei He
> > > >> > > Pei has been contributing to Beam since its inception,
> > accumulating
> > > a
> > > >> > total
> > > >> > > of 118 commits since February. He has made several major
> > > contributions,
> > > >> > > most recently by redesigning IOChannelFactory / FileSystem APIs
> > (in
> > > >> > > progress), which would extend Beam’s portability to many
> > additional
> > > >> file
> > > >> > > systems and cloud providers.
> > > >> > >
> > > >> > > Congratulations to all three! Welcome!
> > > >> > >
> > > >> > > Davor
> > > >> > >
> > > >> >
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> ---
> > > >> Jason Kuster
> > > >> Apache Beam (Incubating) / Google Cloud Dataflow
> > > >>
> > >
> >
>

Re: Committed vs. attempted metrics results

2017-01-26 Thread Aviem Zur

'd be interesting to follow applicatively (I'd expect
> > the
> > > > runner/cluster to properly monitor up time of processes/nodes
> > > separately).
> > > > And even if it is useful, I can't think of other use cases.
> > > >
> > > > I thought the idea was to "declare" the Metrics guarantee level in
> the
> > > > query API, but the more I think about it the more I tend to let it go
> > for
> > > > the following reasons:
> > > >
> > > >- Setting aside Luke's example, I think users would prefer the
> best
> > > >guarantee a runner can provide. And on that note, I'd expect a
> > > > "getMetrics"
> > > >API and not have to figure-out guarantees.
> > > >- Programmatic querying would "break"
> > (UnsupportedOperationExecption)
> > > >portability if a program that was running with a runner that
> > supports
> > > >committed() would try to execute on a runner that only supports
> > > > attempted()
> > > >- I know that portability is for the Pipeline and this is
> > > post-execution
> > > >but still, call it 25% portability issue ;-) .
> > > >- According to the Capability Matrix, all runners fail to provide
> > > >"commit" guarantee for Aggregators. I can only speak for Spark
> > saying
> > > > that
> > > >supporting the Metrics API relies on the same underlying mechanism
> > and
> > > > so
> > > >nothing will change. I wonder about other runners, anyone plans to
> > > > support
> > > >"commit" guarantees for Metrics soon ? having said that, not sure
> > this
> > > > is a
> > > >good reason not to have this as a placeholder.
> > > >
> > > > Another question for querying Metrics - querying by step could be a
> bit
> > > > tricky since a runner is expected to keep unique naming/ids for
> steps,
> > > but
> > > > users are supposed to be aware of this here and I'd suspect users
> might
> > > not
> > > > follow and if they use the same ParDo in a couple of places they'll
> > query
> > > > it and it might be confusing for them to see "double counts" if they
> > > didn't
> > > > mean for that.
> > > >
> > > > Amit.
> > > >
> > > > On Thu, Jan 19, 2017 at 7:36 PM Ben Chambers
> > > <bchamb...@google.com.invalid
> > > > >
> > > > wrote:
> > > >
> > > > > Thanks for starting the discussion! I'm going to hold off saying
> > what I
> > > > > think and instead just provide some background and additional
> > > questions,
> > > > > because I want to see where the discussion goes.
> > > > >
> > > > > When I first suggested the API for querying metrics I was adding it
> > for
> > > > > parity with aggregators. A good first question might be does the
> > > pipeline
> > > > > result even need query methods? Runners could add them as necessary
> > > based
> > > > > on the levels of querying the support.
> > > > >
> > > > > The other desire was to make the accuracy clear. One implementation
> > > path
> > > > > was reporting metrics directly from the workers while attempting
> > work.
> > > > This
> > > > > can overcount when retrying and may be under the actual attempts if
> > the
> > > > > worker lost connectivity before reporting.
> > > > >
> > > > > Another implementation was something like a side output where the
> > > counts
> > > > > are committed as part of each bundles results, and then aggregated.
> > > This
> > > > > committed value is more accurate and represents the value that
> > occurred
> > > > > along the success path of the pipeline.
> > > > >
> > > > > I suspect there are other possible implementations so trying to
> make
> > an
> > > > API
> > > > > that expresses all of them is difficult. So:
> > > > >
> > > > > 1. Does pipeline result need to support querying (which is useful
> for
> > > > > programmatic consumption) or are metrics intended only to get
> values
> > > out
> > > > of
> > > &

57 matches

Mail list logo