Re: [VOTE] Policies for managing Beam dependencies

2018-06-11 Thread Bashir Sadjad
FWIW, I also think that this has relevance for users. I am a user of Beam
not a contributor and only monitor this list at a high level. But I think
the dependency issue is something that many users have to deal with. It has
bitten us at least twice over the last few months due to the fact that we
depend on other libraries too and sometimes we get version conflicts (which
is one of the issues highlighted in the doc

Cham shared). I usually go through file histories on GitHub to try to
figure out why a certain version requirement is there. It would be nice if
the reasons are maintained at a higher level easier to consume by users.

Cheers

-B

On Tue, Jun 12, 2018 at 12:19 AM Ahmet Altay  wrote:

> I think this is relevant for users. It makes sense for users to know about
> how Beam work with its dependencies and understand how conflicts will be
> addressed and when dependencies will be upgraded.
>
> On Mon, Jun 11, 2018 at 9:09 PM, Kenneth Knowles  wrote:
>
>> Do you think this has relevance for users?
>>
>> If not, it might be a good use of the new Confluence space. I'm not too
>> familiar with the way permission work, but perhaps we can have a more
>> locked down area that is for policy decisions like this.
>>
>> Kenn
>>
>> On Mon, Jun 11, 2018 at 3:58 PM Chamikara Jayalath 
>> wrote:
>>
>>> Hi All,
>>>
>>> Based on the vote (3 PMC +1s and no -1s) and based on the discussions in
>>> the doc (seems to be mostly positive), I think we can go ahead and
>>> implement some of the policies discussed so far.
>>>
>>> I have given some of the potential action items below.
>>>
>>> * Automatically generate human readable reports on status of Beam
>>> dependencies weekly and share in dev list.
>>> * Create JIRAs for significantly outdated dependencies based on above
>>> reports.
>>> * Copy some of the component level dependency version declarations to
>>> top level.
>>> * Try to identify owners for dependencies and specify owners in comments
>>> close to dependency declarations.
>>> * Vendor any dependencies that can cause issues if leaked to other
>>> components.
>>> * Add policies discussed so far to the Web site along with reasoning
>>> (from doc).
>>>
>>> Of course, I'm happy to refine or add to these polices as needed.
>>>
>>> Thanks,
>>> Cham
>>>
>>>
>>> On Thu, Jun 7, 2018 at 9:40 AM Lukasz Cwik  wrote:
>>>
 +1

 On Thu, Jun 7, 2018 at 5:18 AM Kenneth Knowles  wrote:

> +1 to these. Thanks for clarifying!
>
> Kenn
>
> On Wed, Jun 6, 2018 at 10:40 PM Chamikara Jayalath <
> chamik...@google.com> wrote:
>
>> Hi Kenn,
>>
>> On Wed, Jun 6, 2018 at 8:14 PM Kenneth Knowles 
>> wrote:
>>
>>> +0.5
>>>
>>> I like the spirit of these policies. I think they need a little
>>> wording work. Comments inline.
>>>
>>> On Wed, Jun 6, 2018 at 4:53 PM, Chamikara Jayalath <
 chamik...@google.com> wrote:
>
>
> (1) Human readable reports on status of Beam dependencies are
> generated weekly and shared with the Beam community through the dev 
> list.
>

>>> Who is responsible for generating these? The mechanism or
>>> responsibility should be made clear.
>>>
>>> I clicked through a doc -> thread -> doc to find even some details.
>>> It looks like manual run of a gradle command was adopted. So the
>>> responsibility needs an owner, even if it is "unspecified volunteer on 
>>> dev@
>>> and feel free to complain or do it yourself if you don't see it"
>>>
>>
>> This is described in following doc (referenced by my doc).
>>
>> https://docs.google.com/document/d/1rqr_8a9NYZCgeiXpTIwWLCL7X8amPAVfRXsO72BpBwA/edit#
>>
>> Proposal is to run an automated Jenkins job that is run weekly, so no
>> need for someone to manually generate these reports.
>>
>>
>>>
>>> (2) Beam components should define dependencies and their versions at
> the top level.
>

>>> I think the big "should" works better with some guidance about when
>>> something might be an exception, or at least explicit mention that there
>>> can be rare exceptions. Unless you think that is never the case. If 
>>> there
>>> are no exceptions, then say "must" and if we hit a roadblock we can 
>>> revisit
>>> the policy.
>>>
>>
>> The idea was to allow exceptions. Added more details to the doc.
>>
>>
>>>
>>>
>>> (3) A significantly outdated dependency (identified manually or
> through tooling) should result in a JIRA that is a blocker for the 
> next
> release. Release manager may choose to push the blocker to the 
> subsequent
> release or downgrade from a blocker.
>

>>> How is "significantly outdated" defined? By dev@ discussion? Seems

Re: [VOTE] Apache Beam, version 2.5.0, release candidate #1

2018-06-11 Thread Ahmet Altay
Thank you JB.

For the wheel artifacts, Boyuan was trying to get the instructions from
Robert and reproduce the artifacts. She can help you with this if you need.

Ahmet

On Mon, Jun 11, 2018 at 10:29 PM, Jean-Baptiste Onofré 
wrote:

> Hi,
>
> sorry, I missed wheel artifact. Something to add on the release guide ;)
>
> I will add it this morning, I think I know how to generate it ;)
>
> Regards
> JB
>
> On 12/06/2018 02:45, Pablo Estrada wrote:
> > Thanks everyone who has pitched in to validate the release!
> >
> > Boyuan Zhang and I have also run a few pipelines, and verified that they
> > work properly (see release validation spreadsheet[1]).
> >
> > We have also found that the Game Stats pipeline is failing in Python
> > Streaming Dataflow. I have filed BEAM-4534[2]. This is not a blocker,
> > since Python streaming is not yet fully supported.
> >
> > It seems that the uploaded artifacts look good.
> >
> > We have noticed that the Python artifacts are still missing Python wheel
> > files (compare [3] and [4]). JB, could you please add the wheel files?
> > Boyuan and I can try to help you prepare them / upload them if
> > necessary. Please let us know.
> >
> > Thanks again!
> > -P.
> >
> > [1] https://docs.google.com/spreadsheets/d/1qk-
> N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=152451807
> > [2] https://issues.apache.org/jira/browse/BEAM-4534
> > [3] https://dist.apache.org/repos/dist/dev/beam/2.4.0/
> > [4] https://dist.apache.org/repos/dist/dev/beam/2.5.0/
> >
> > On Mon, Jun 11, 2018 at 12:37 PM Alan Myrvold  > > wrote:
> >
> > +1 (non-binding)
> >
> > tested some of the quickstarts
> >
> > On Sun, Jun 10, 2018 at 1:39 AM Tim  > > wrote:
> >
> > Tested by our team:
> > - mvn inclusion
> > - Avro, ES, Hadoop IF IO
> > - Pipelines run on Spark (Cloudera 5.12.0 YARN cluster)
> > - Reviewed release notes
> >
> > +1
> >
> > Thanks also to everyone who helped get over the gradle hurdle
> > and in particular to JB.
> >
> > Tim
> >
> > > On 9 Jun 2018, at 05:56, Jean-Baptiste Onofré  > > wrote:
> > >
> > > No problem Pablo.
> > >
> > > The vote period is a minimum, it can be extended as requested
> > or if we
> > > don't have the minimum of 3 binding votes.
> > >
> > > Regards
> > > JB
> > >
> > >> On 09/06/2018 01:54, Pablo Estrada wrote:
> > >> Hello all,
> > >> I'd like to request an extension of the voting period until
> > Monday
> > >> evening (US time, so later in other geographical regions).
> > This is
> > >> because we were only now able to publish Dataflow Workers,
> > and have not
> > >> had the chance to run release validation tests on them. The
> > extension
> > >> will allow us to validate and vote by Monday.
> > >> Is this acceptable to the community?
> > >>
> > >> Best
> > >> -P.
> > >>
> > >> On Fri, Jun 8, 2018 at 6:20 AM Alexey Romanenko
> > >> mailto:aromanenko@gmail.com>
> >  > >> wrote:
> > >>
> > >>Thank you JB for your work!
> > >>
> > >>I tested running simple streaming (/KafkaIO/) and batch
> > (/TextIO /
> > >>HDFS/) pipelines with SparkRunner on YARN cluster - it
> > works fine.
> > >>
> > >>WBR,
> > >>Alexey
> > >>
> > >>
> > >>>On 8 Jun 2018, at 10:00, Etienne Chauchot
> > mailto:echauc...@apache.org>
> > >>> > >> wrote:
> > >>>
> > >>>I forgot to vote:
> > >>>+1 (non binding).
> > >>>What I tested:
> > >>>- no functional or performance regression comparing to
> v2.4
> > >>>- dependencies in the poms are ok
> > >>>
> > >>>Etienne
> > Le vendredi 08 juin 2018 à 08:27 +0200, Romain
> > Manni-Bucau a écrit :
> > +1 (non-binding), mainstream usage is not broken by the
> pom
> > changes and runtime has no known regression compared to
> > the 2.4.0
> > 
> > (side note: kudo to JB for this build tool change
> > release, I know
> > how it can hurt ;))
> > 
> > Romain Manni-Bucau
> > @rmannibucau  |  Blog
> >  | Old Blog
> >  | Github
> >  | LinkedIn

Re: [VOTE] Apache Beam, version 2.5.0, release candidate #1

2018-06-11 Thread Jean-Baptiste Onofré
Hi,

sorry, I missed wheel artifact. Something to add on the release guide ;)

I will add it this morning, I think I know how to generate it ;)

Regards
JB

On 12/06/2018 02:45, Pablo Estrada wrote:
> Thanks everyone who has pitched in to validate the release!
> 
> Boyuan Zhang and I have also run a few pipelines, and verified that they
> work properly (see release validation spreadsheet[1]).
> 
> We have also found that the Game Stats pipeline is failing in Python
> Streaming Dataflow. I have filed BEAM-4534[2]. This is not a blocker,
> since Python streaming is not yet fully supported.
> 
> It seems that the uploaded artifacts look good.
> 
> We have noticed that the Python artifacts are still missing Python wheel
> files (compare [3] and [4]). JB, could you please add the wheel files?
> Boyuan and I can try to help you prepare them / upload them if
> necessary. Please let us know.
> 
> Thanks again!
> -P.
> 
> [1] 
> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=152451807
> [2] https://issues.apache.org/jira/browse/BEAM-4534
> [3] https://dist.apache.org/repos/dist/dev/beam/2.4.0/
> [4] https://dist.apache.org/repos/dist/dev/beam/2.5.0/
> 
> On Mon, Jun 11, 2018 at 12:37 PM Alan Myrvold  > wrote:
> 
> +1 (non-binding)
> 
> tested some of the quickstarts
> 
> On Sun, Jun 10, 2018 at 1:39 AM Tim  > wrote:
> 
> Tested by our team:
> - mvn inclusion
> - Avro, ES, Hadoop IF IO
> - Pipelines run on Spark (Cloudera 5.12.0 YARN cluster)
> - Reviewed release notes
> 
> +1
> 
> Thanks also to everyone who helped get over the gradle hurdle
> and in particular to JB.
> 
> Tim
> 
> > On 9 Jun 2018, at 05:56, Jean-Baptiste Onofré  > wrote:
> >
> > No problem Pablo.
> >
> > The vote period is a minimum, it can be extended as requested
> or if we
> > don't have the minimum of 3 binding votes.
> >
> > Regards
> > JB
> >
> >> On 09/06/2018 01:54, Pablo Estrada wrote:
> >> Hello all,
> >> I'd like to request an extension of the voting period until
> Monday
> >> evening (US time, so later in other geographical regions).
> This is
> >> because we were only now able to publish Dataflow Workers,
> and have not
> >> had the chance to run release validation tests on them. The
> extension
> >> will allow us to validate and vote by Monday.
> >> Is this acceptable to the community?
> >>
> >> Best
> >> -P.
> >>
> >> On Fri, Jun 8, 2018 at 6:20 AM Alexey Romanenko
> >> mailto:aromanenko@gmail.com>
>  >> wrote:
> >>
> >>    Thank you JB for your work!
> >>
> >>    I tested running simple streaming (/KafkaIO/) and batch
> (/TextIO /
> >>    HDFS/) pipelines with SparkRunner on YARN cluster - it
> works fine.
> >>
> >>    WBR,
> >>    Alexey
> >>
> >>
> >>>    On 8 Jun 2018, at 10:00, Etienne Chauchot
> mailto:echauc...@apache.org>
> >>>     >> wrote:
> >>>
> >>>    I forgot to vote:
> >>>    +1 (non binding).
> >>>    What I tested:
> >>>    - no functional or performance regression comparing to v2.4
> >>>    - dependencies in the poms are ok
> >>>
> >>>    Etienne
>     Le vendredi 08 juin 2018 à 08:27 +0200, Romain
> Manni-Bucau a écrit :
>     +1 (non-binding), mainstream usage is not broken by the pom
>     changes and runtime has no known regression compared to
> the 2.4.0
> 
>     (side note: kudo to JB for this build tool change
> release, I know
>     how it can hurt ;))
> 
>     Romain Manni-Bucau
>     @rmannibucau  |  Blog
>      | Old Blog
>      | Github
>      | LinkedIn
>      | Book
>    
> 
> 
> 
> 
>     Le jeu. 7 juin 2018 à 16:17, Jean-Baptiste Onofré
>     mailto:j...@nanthrax.net>
> >> a écrit :
> >    Thanks for the details Etienne !
> >
> 

Re: [VOTE] Apache Beam, version 2.5.0, release candidate #1

2018-06-11 Thread Jean-Baptiste Onofré
Hi,

no problem, I can cut RC2 as soon as the cherry pick is done.

Thanks for catching up !

Please let me know when the cherry pick is done, or you can do the PR
and I will do it, up to you.

Regards
JB

On 12/06/2018 04:02, Udi Meiri wrote:
> Another bug: reading from PubSub with_attributes=True is broken on
> Python with Dataflow.
> https://issues.apache.org/jira/browse/BEAM-4536
> 
> JB, I'm making a PR that removes this keyword and I'd like to propose it
> as a cherrypick to 2.5.0.
> (feature should be fixed in the next release)
> 
> On Mon, Jun 11, 2018 at 6:19 PM Chamikara Jayalath  > wrote:
> 
> FYI: looks like Python tests are failing for Windows. JIRA
> is https://issues.apache.org/jira/browse/BEAM-4535.
> 
> I don't think this is a release blocker but this should probably go
> in release notes (for any user that tries to run tests on Python
> source build). And we should try to incorporate a fix if we happen
> to cut another release candidate for some reason.
> 
> Thanks,
> Cham
> 
> On Mon, Jun 11, 2018 at 5:46 PM Pablo Estrada  > wrote:
> 
> Thanks everyone who has pitched in to validate the release!
> 
> Boyuan Zhang and I have also run a few pipelines, and verified
> that they work properly (see release validation spreadsheet[1]).
> 
> We have also found that the Game Stats pipeline is failing in
> Python Streaming Dataflow. I have filed BEAM-4534[2]. This is
> not a blocker, since Python streaming is not yet fully supported.
> 
> It seems that the uploaded artifacts look good.
> 
> We have noticed that the Python artifacts are still missing
> Python wheel files (compare [3] and [4]). JB, could you please
> add the wheel files? Boyuan and I can try to help you prepare
> them / upload them if necessary. Please let us know.
> 
> Thanks again!
> -P.
> 
> [1] 
> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=152451807
> [2] https://issues.apache.org/jira/browse/BEAM-4534
> [3] https://dist.apache.org/repos/dist/dev/beam/2.4.0/
> [4] https://dist.apache.org/repos/dist/dev/beam/2.5.0/
> 
> On Mon, Jun 11, 2018 at 12:37 PM Alan Myrvold
> mailto:amyrv...@google.com>> wrote:
> 
> +1 (non-binding)
> 
> tested some of the quickstarts
> 
> On Sun, Jun 10, 2018 at 1:39 AM Tim
>  > wrote:
> 
> Tested by our team:
> - mvn inclusion
> - Avro, ES, Hadoop IF IO
> - Pipelines run on Spark (Cloudera 5.12.0 YARN cluster)
> - Reviewed release notes
> 
> +1
> 
> Thanks also to everyone who helped get over the gradle
> hurdle and in particular to JB.
> 
> Tim
> 
> > On 9 Jun 2018, at 05:56, Jean-Baptiste Onofré
> mailto:j...@nanthrax.net>> wrote:
> >
> > No problem Pablo.
> >
> > The vote period is a minimum, it can be extended as
> requested or if we
> > don't have the minimum of 3 binding votes.
> >
> > Regards
> > JB
> >
> >> On 09/06/2018 01:54, Pablo Estrada wrote:
> >> Hello all,
> >> I'd like to request an extension of the voting period
> until Monday
> >> evening (US time, so later in other geographical
> regions). This is
> >> because we were only now able to publish Dataflow
> Workers, and have not
> >> had the chance to run release validation tests on
> them. The extension
> >> will allow us to validate and vote by Monday.
> >> Is this acceptable to the community?
> >>
> >> Best
> >> -P.
> >>
> >> On Fri, Jun 8, 2018 at 6:20 AM Alexey Romanenko
> >>  
>  >> wrote:
> >>
> >>    Thank you JB for your work!
> >>
> >>    I tested running simple streaming (/KafkaIO/) and
> batch (/TextIO /
> >>    HDFS/) pipelines with SparkRunner on YARN cluster
> - it works fine.
> >>
> >>    WBR,
> >>    Alexey
> >>
> >>
> >>>    On 8 Jun 2018, at 10:00, Etienne Chauchot
>

Re: [VOTE] Policies for managing Beam dependencies

2018-06-11 Thread Ahmet Altay
I think this is relevant for users. It makes sense for users to know about
how Beam work with its dependencies and understand how conflicts will be
addressed and when dependencies will be upgraded.

On Mon, Jun 11, 2018 at 9:09 PM, Kenneth Knowles  wrote:

> Do you think this has relevance for users?
>
> If not, it might be a good use of the new Confluence space. I'm not too
> familiar with the way permission work, but perhaps we can have a more
> locked down area that is for policy decisions like this.
>
> Kenn
>
> On Mon, Jun 11, 2018 at 3:58 PM Chamikara Jayalath 
> wrote:
>
>> Hi All,
>>
>> Based on the vote (3 PMC +1s and no -1s) and based on the discussions in
>> the doc (seems to be mostly positive), I think we can go ahead and
>> implement some of the policies discussed so far.
>>
>> I have given some of the potential action items below.
>>
>> * Automatically generate human readable reports on status of Beam
>> dependencies weekly and share in dev list.
>> * Create JIRAs for significantly outdated dependencies based on above
>> reports.
>> * Copy some of the component level dependency version declarations to top
>> level.
>> * Try to identify owners for dependencies and specify owners in comments
>> close to dependency declarations.
>> * Vendor any dependencies that can cause issues if leaked to other
>> components.
>> * Add policies discussed so far to the Web site along with reasoning
>> (from doc).
>>
>> Of course, I'm happy to refine or add to these polices as needed.
>>
>> Thanks,
>> Cham
>>
>>
>> On Thu, Jun 7, 2018 at 9:40 AM Lukasz Cwik  wrote:
>>
>>> +1
>>>
>>> On Thu, Jun 7, 2018 at 5:18 AM Kenneth Knowles  wrote:
>>>
 +1 to these. Thanks for clarifying!

 Kenn

 On Wed, Jun 6, 2018 at 10:40 PM Chamikara Jayalath <
 chamik...@google.com> wrote:

> Hi Kenn,
>
> On Wed, Jun 6, 2018 at 8:14 PM Kenneth Knowles  wrote:
>
>> +0.5
>>
>> I like the spirit of these policies. I think they need a little
>> wording work. Comments inline.
>>
>> On Wed, Jun 6, 2018 at 4:53 PM, Chamikara Jayalath <
>>> chamik...@google.com> wrote:


 (1) Human readable reports on status of Beam dependencies are
 generated weekly and shared with the Beam community through the dev 
 list.

>>>
>> Who is responsible for generating these? The mechanism or
>> responsibility should be made clear.
>>
>> I clicked through a doc -> thread -> doc to find even some details.
>> It looks like manual run of a gradle command was adopted. So the
>> responsibility needs an owner, even if it is "unspecified volunteer on 
>> dev@
>> and feel free to complain or do it yourself if you don't see it"
>>
>
> This is described in following doc (referenced by my doc).
> https://docs.google.com/document/d/1rqr_8a9NYZCgeiXpTIwWLCL7X8amPAVfRX
> sO72BpBwA/edit#
>
> Proposal is to run an automated Jenkins job that is run weekly, so no
> need for someone to manually generate these reports.
>
>
>>
>> (2) Beam components should define dependencies and their versions at
 the top level.

>>>
>> I think the big "should" works better with some guidance about when
>> something might be an exception, or at least explicit mention that there
>> can be rare exceptions. Unless you think that is never the case. If there
>> are no exceptions, then say "must" and if we hit a roadblock we can 
>> revisit
>> the policy.
>>
>
> The idea was to allow exceptions. Added more details to the doc.
>
>
>>
>>
>> (3) A significantly outdated dependency (identified manually or
 through tooling) should result in a JIRA that is a blocker for the next
 release. Release manager may choose to push the blocker to the 
 subsequent
 release or downgrade from a blocker.

>>>
>> How is "significantly outdated" defined? By dev@ discussion? Seems
>> like the right way. Anyhow that's what will happen in practice as people
>> debate the blocker bug.
>>
>
> This will be either through the automated Jenkins job (see the doc
> above, where the proposal is to flag new major versions and new minor
> versions that are more than six months old) or manually (for any critical
> updates that will not be captured by the Jenkins job) (more details in the
> doc). Manually identified critical dependency updates may involve a
> discussion in the dev list.
>
>
>>
>>
>> (4) Dependency declarations may identify owners that are responsible
 for upgrading the respective dependencies.

 (5) Dependencies of Java SDK components that may cause issues to
 other components if leaked should be shaded.

>>>
>> We previously agreed upon our intent to migrate to "pre-shaded" aka
>> 

Re: [VOTE] Policies for managing Beam dependencies

2018-06-11 Thread Kenneth Knowles
Do you think this has relevance for users?

If not, it might be a good use of the new Confluence space. I'm not too
familiar with the way permission work, but perhaps we can have a more
locked down area that is for policy decisions like this.

Kenn

On Mon, Jun 11, 2018 at 3:58 PM Chamikara Jayalath 
wrote:

> Hi All,
>
> Based on the vote (3 PMC +1s and no -1s) and based on the discussions in
> the doc (seems to be mostly positive), I think we can go ahead and
> implement some of the policies discussed so far.
>
> I have given some of the potential action items below.
>
> * Automatically generate human readable reports on status of Beam
> dependencies weekly and share in dev list.
> * Create JIRAs for significantly outdated dependencies based on above
> reports.
> * Copy some of the component level dependency version declarations to top
> level.
> * Try to identify owners for dependencies and specify owners in comments
> close to dependency declarations.
> * Vendor any dependencies that can cause issues if leaked to other
> components.
> * Add policies discussed so far to the Web site along with reasoning (from
> doc).
>
> Of course, I'm happy to refine or add to these polices as needed.
>
> Thanks,
> Cham
>
>
> On Thu, Jun 7, 2018 at 9:40 AM Lukasz Cwik  wrote:
>
>> +1
>>
>> On Thu, Jun 7, 2018 at 5:18 AM Kenneth Knowles  wrote:
>>
>>> +1 to these. Thanks for clarifying!
>>>
>>> Kenn
>>>
>>> On Wed, Jun 6, 2018 at 10:40 PM Chamikara Jayalath 
>>> wrote:
>>>
 Hi Kenn,

 On Wed, Jun 6, 2018 at 8:14 PM Kenneth Knowles  wrote:

> +0.5
>
> I like the spirit of these policies. I think they need a little
> wording work. Comments inline.
>
> On Wed, Jun 6, 2018 at 4:53 PM, Chamikara Jayalath <
>> chamik...@google.com> wrote:
>>>
>>>
>>> (1) Human readable reports on status of Beam dependencies are
>>> generated weekly and shared with the Beam community through the dev 
>>> list.
>>>
>>
> Who is responsible for generating these? The mechanism or
> responsibility should be made clear.
>
> I clicked through a doc -> thread -> doc to find even some details. It
> looks like manual run of a gradle command was adopted. So the
> responsibility needs an owner, even if it is "unspecified volunteer on 
> dev@
> and feel free to complain or do it yourself if you don't see it"
>

 This is described in following doc (referenced by my doc).

 https://docs.google.com/document/d/1rqr_8a9NYZCgeiXpTIwWLCL7X8amPAVfRXsO72BpBwA/edit#

 Proposal is to run an automated Jenkins job that is run weekly, so no
 need for someone to manually generate these reports.


>
> (2) Beam components should define dependencies and their versions at
>>> the top level.
>>>
>>
> I think the big "should" works better with some guidance about when
> something might be an exception, or at least explicit mention that there
> can be rare exceptions. Unless you think that is never the case. If there
> are no exceptions, then say "must" and if we hit a roadblock we can 
> revisit
> the policy.
>

 The idea was to allow exceptions. Added more details to the doc.


>
>
> (3) A significantly outdated dependency (identified manually or
>>> through tooling) should result in a JIRA that is a blocker for the next
>>> release. Release manager may choose to push the blocker to the 
>>> subsequent
>>> release or downgrade from a blocker.
>>>
>>
> How is "significantly outdated" defined? By dev@ discussion? Seems
> like the right way. Anyhow that's what will happen in practice as people
> debate the blocker bug.
>

 This will be either through the automated Jenkins job (see the doc
 above, where the proposal is to flag new major versions and new minor
 versions that are more than six months old) or manually (for any critical
 updates that will not be captured by the Jenkins job) (more details in the
 doc). Manually identified critical dependency updates may involve a
 discussion in the dev list.


>
>
> (4) Dependency declarations may identify owners that are responsible
>>> for upgrading the respective dependencies.
>>>
>>> (5) Dependencies of Java SDK components that may cause issues to
>>> other components if leaked should be shaded.
>>>
>>
> We previously agreed upon our intent to migrate to "pre-shaded" aka
> "vendored" packages:
> https://lists.apache.org/thread.html/12383d2e5d70026427df43294e30d6524334e16f03d86c9a5860792f@%3Cdev.beam.apache.org%3E
>
> With Maven, this involved a lot of boilerplate so I never did it. With
> Gradle, we can easily build a re-usable rule to create such a package in a
> couple of lines. I just opened the first WIP PR here:
> https://github.com/apache/beam/pull/5570 it is blocked 

Re: [VOTE] Apache Beam, version 2.5.0, release candidate #1

2018-06-11 Thread Udi Meiri
Another bug: reading from PubSub with_attributes=True is broken on Python
with Dataflow.
https://issues.apache.org/jira/browse/BEAM-4536

JB, I'm making a PR that removes this keyword and I'd like to propose it as
a cherrypick to 2.5.0.
(feature should be fixed in the next release)

On Mon, Jun 11, 2018 at 6:19 PM Chamikara Jayalath 
wrote:

> FYI: looks like Python tests are failing for Windows. JIRA is
> https://issues.apache.org/jira/browse/BEAM-4535.
>
> I don't think this is a release blocker but this should probably go in
> release notes (for any user that tries to run tests on Python source
> build). And we should try to incorporate a fix if we happen to cut another
> release candidate for some reason.
>
> Thanks,
> Cham
>
> On Mon, Jun 11, 2018 at 5:46 PM Pablo Estrada  wrote:
>
>> Thanks everyone who has pitched in to validate the release!
>>
>> Boyuan Zhang and I have also run a few pipelines, and verified that they
>> work properly (see release validation spreadsheet[1]).
>>
>> We have also found that the Game Stats pipeline is failing in Python
>> Streaming Dataflow. I have filed BEAM-4534[2]. This is not a blocker, since
>> Python streaming is not yet fully supported.
>>
>> It seems that the uploaded artifacts look good.
>>
>> We have noticed that the Python artifacts are still missing Python wheel
>> files (compare [3] and [4]). JB, could you please add the wheel files?
>> Boyuan and I can try to help you prepare them / upload them if necessary.
>> Please let us know.
>>
>> Thanks again!
>> -P.
>>
>> [1]
>> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=152451807
>> [2] https://issues.apache.org/jira/browse/BEAM-4534
>> [3] https://dist.apache.org/repos/dist/dev/beam/2.4.0/
>> [4] https://dist.apache.org/repos/dist/dev/beam/2.5.0/
>>
>> On Mon, Jun 11, 2018 at 12:37 PM Alan Myrvold 
>> wrote:
>>
>>> +1 (non-binding)
>>>
>>> tested some of the quickstarts
>>>
>>> On Sun, Jun 10, 2018 at 1:39 AM Tim  wrote:
>>>
 Tested by our team:
 - mvn inclusion
 - Avro, ES, Hadoop IF IO
 - Pipelines run on Spark (Cloudera 5.12.0 YARN cluster)
 - Reviewed release notes

 +1

 Thanks also to everyone who helped get over the gradle hurdle and in
 particular to JB.

 Tim

 > On 9 Jun 2018, at 05:56, Jean-Baptiste Onofré 
 wrote:
 >
 > No problem Pablo.
 >
 > The vote period is a minimum, it can be extended as requested or if we
 > don't have the minimum of 3 binding votes.
 >
 > Regards
 > JB
 >
 >> On 09/06/2018 01:54, Pablo Estrada wrote:
 >> Hello all,
 >> I'd like to request an extension of the voting period until Monday
 >> evening (US time, so later in other geographical regions). This is
 >> because we were only now able to publish Dataflow Workers, and have
 not
 >> had the chance to run release validation tests on them. The extension
 >> will allow us to validate and vote by Monday.
 >> Is this acceptable to the community?
 >>
 >> Best
 >> -P.
 >>
 >> On Fri, Jun 8, 2018 at 6:20 AM Alexey Romanenko
 >> mailto:aromanenko@gmail.com>> wrote:
 >>
 >>Thank you JB for your work!
 >>
 >>I tested running simple streaming (/KafkaIO/) and batch (/TextIO /
 >>HDFS/) pipelines with SparkRunner on YARN cluster - it works fine.
 >>
 >>WBR,
 >>Alexey
 >>
 >>
 >>>On 8 Jun 2018, at 10:00, Etienne Chauchot >>> >>>> wrote:
 >>>
 >>>I forgot to vote:
 >>>+1 (non binding).
 >>>What I tested:
 >>>- no functional or performance regression comparing to v2.4
 >>>- dependencies in the poms are ok
 >>>
 >>>Etienne
 Le vendredi 08 juin 2018 à 08:27 +0200, Romain Manni-Bucau a
 écrit :
 +1 (non-binding), mainstream usage is not broken by the pom
 changes and runtime has no known regression compared to the
 2.4.0
 
 (side note: kudo to JB for this build tool change release, I
 know
 how it can hurt ;))
 
 Romain Manni-Bucau
 @rmannibucau  |  Blog
  | Old Blog
  | Github
  | LinkedIn
  | Book
 <
 https://www.packtpub.com/application-development/java-ee-8-high-performance
 >
 
 
 Le jeu. 7 juin 2018 à 16:17, Jean-Baptiste Onofré
 mailto:j...@nanthrax.net>> a écrit :
 >Thanks for the details Etienne !
 >
 >The good news is that the artifacts seem OK and the overall
 Nexmark
 >results are consistent with the 2.4.0 release ones.
 >
 

Re: [VOTE] Apache Beam, version 2.5.0, release candidate #1

2018-06-11 Thread Chamikara Jayalath
FYI: looks like Python tests are failing for Windows. JIRA is
https://issues.apache.org/jira/browse/BEAM-4535.

I don't think this is a release blocker but this should probably go in
release notes (for any user that tries to run tests on Python source
build). And we should try to incorporate a fix if we happen to cut another
release candidate for some reason.

Thanks,
Cham

On Mon, Jun 11, 2018 at 5:46 PM Pablo Estrada  wrote:

> Thanks everyone who has pitched in to validate the release!
>
> Boyuan Zhang and I have also run a few pipelines, and verified that they
> work properly (see release validation spreadsheet[1]).
>
> We have also found that the Game Stats pipeline is failing in Python
> Streaming Dataflow. I have filed BEAM-4534[2]. This is not a blocker, since
> Python streaming is not yet fully supported.
>
> It seems that the uploaded artifacts look good.
>
> We have noticed that the Python artifacts are still missing Python wheel
> files (compare [3] and [4]). JB, could you please add the wheel files?
> Boyuan and I can try to help you prepare them / upload them if necessary.
> Please let us know.
>
> Thanks again!
> -P.
>
> [1]
> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=152451807
> [2] https://issues.apache.org/jira/browse/BEAM-4534
> [3] https://dist.apache.org/repos/dist/dev/beam/2.4.0/
> [4] https://dist.apache.org/repos/dist/dev/beam/2.5.0/
>
> On Mon, Jun 11, 2018 at 12:37 PM Alan Myrvold  wrote:
>
>> +1 (non-binding)
>>
>> tested some of the quickstarts
>>
>> On Sun, Jun 10, 2018 at 1:39 AM Tim  wrote:
>>
>>> Tested by our team:
>>> - mvn inclusion
>>> - Avro, ES, Hadoop IF IO
>>> - Pipelines run on Spark (Cloudera 5.12.0 YARN cluster)
>>> - Reviewed release notes
>>>
>>> +1
>>>
>>> Thanks also to everyone who helped get over the gradle hurdle and in
>>> particular to JB.
>>>
>>> Tim
>>>
>>> > On 9 Jun 2018, at 05:56, Jean-Baptiste Onofré  wrote:
>>> >
>>> > No problem Pablo.
>>> >
>>> > The vote period is a minimum, it can be extended as requested or if we
>>> > don't have the minimum of 3 binding votes.
>>> >
>>> > Regards
>>> > JB
>>> >
>>> >> On 09/06/2018 01:54, Pablo Estrada wrote:
>>> >> Hello all,
>>> >> I'd like to request an extension of the voting period until Monday
>>> >> evening (US time, so later in other geographical regions). This is
>>> >> because we were only now able to publish Dataflow Workers, and have
>>> not
>>> >> had the chance to run release validation tests on them. The extension
>>> >> will allow us to validate and vote by Monday.
>>> >> Is this acceptable to the community?
>>> >>
>>> >> Best
>>> >> -P.
>>> >>
>>> >> On Fri, Jun 8, 2018 at 6:20 AM Alexey Romanenko
>>> >> mailto:aromanenko@gmail.com>> wrote:
>>> >>
>>> >>Thank you JB for your work!
>>> >>
>>> >>I tested running simple streaming (/KafkaIO/) and batch (/TextIO /
>>> >>HDFS/) pipelines with SparkRunner on YARN cluster - it works fine.
>>> >>
>>> >>WBR,
>>> >>Alexey
>>> >>
>>> >>
>>> >>>On 8 Jun 2018, at 10:00, Etienne Chauchot >> >>>> wrote:
>>> >>>
>>> >>>I forgot to vote:
>>> >>>+1 (non binding).
>>> >>>What I tested:
>>> >>>- no functional or performance regression comparing to v2.4
>>> >>>- dependencies in the poms are ok
>>> >>>
>>> >>>Etienne
>>> Le vendredi 08 juin 2018 à 08:27 +0200, Romain Manni-Bucau a
>>> écrit :
>>> +1 (non-binding), mainstream usage is not broken by the pom
>>> changes and runtime has no known regression compared to the 2.4.0
>>> 
>>> (side note: kudo to JB for this build tool change release, I know
>>> how it can hurt ;))
>>> 
>>> Romain Manni-Bucau
>>> @rmannibucau  |  Blog
>>>  | Old Blog
>>>  | Github
>>>  | LinkedIn
>>>  | Book
>>> <
>>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>>> >
>>> 
>>> 
>>> Le jeu. 7 juin 2018 à 16:17, Jean-Baptiste Onofré
>>> mailto:j...@nanthrax.net>> a écrit :
>>> >Thanks for the details Etienne !
>>> >
>>> >The good news is that the artifacts seem OK and the overall
>>> Nexmark
>>> >results are consistent with the 2.4.0 release ones.
>>> >
>>> >I'm starting a complete review using the beam-samples as well.
>>> >
>>> >Regards
>>> >JB
>>> >
>>> >>On 07/06/2018 16:14, Etienne Chauchot wrote:
>>> >> Hi,
>>> >> I've just run the nexmark queries on v2.5.0-RC1 tag
>>> >> What we can notice:
>>> >> - query 3 (exercises CoGroupByKey, state and timer) shows
>>> >different
>>> >> output with DR between batch and streaming and with the other
>>> >runners =>
>>> >> I 

Re: [VOTE] Apache Beam, version 2.5.0, release candidate #1

2018-06-11 Thread Pablo Estrada
Thanks everyone who has pitched in to validate the release!

Boyuan Zhang and I have also run a few pipelines, and verified that they
work properly (see release validation spreadsheet[1]).

We have also found that the Game Stats pipeline is failing in Python
Streaming Dataflow. I have filed BEAM-4534[2]. This is not a blocker, since
Python streaming is not yet fully supported.

It seems that the uploaded artifacts look good.

We have noticed that the Python artifacts are still missing Python wheel
files (compare [3] and [4]). JB, could you please add the wheel files?
Boyuan and I can try to help you prepare them / upload them if necessary.
Please let us know.

Thanks again!
-P.

[1]
https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=152451807
[2] https://issues.apache.org/jira/browse/BEAM-4534
[3] https://dist.apache.org/repos/dist/dev/beam/2.4.0/
[4] https://dist.apache.org/repos/dist/dev/beam/2.5.0/

On Mon, Jun 11, 2018 at 12:37 PM Alan Myrvold  wrote:

> +1 (non-binding)
>
> tested some of the quickstarts
>
> On Sun, Jun 10, 2018 at 1:39 AM Tim  wrote:
>
>> Tested by our team:
>> - mvn inclusion
>> - Avro, ES, Hadoop IF IO
>> - Pipelines run on Spark (Cloudera 5.12.0 YARN cluster)
>> - Reviewed release notes
>>
>> +1
>>
>> Thanks also to everyone who helped get over the gradle hurdle and in
>> particular to JB.
>>
>> Tim
>>
>> > On 9 Jun 2018, at 05:56, Jean-Baptiste Onofré  wrote:
>> >
>> > No problem Pablo.
>> >
>> > The vote period is a minimum, it can be extended as requested or if we
>> > don't have the minimum of 3 binding votes.
>> >
>> > Regards
>> > JB
>> >
>> >> On 09/06/2018 01:54, Pablo Estrada wrote:
>> >> Hello all,
>> >> I'd like to request an extension of the voting period until Monday
>> >> evening (US time, so later in other geographical regions). This is
>> >> because we were only now able to publish Dataflow Workers, and have not
>> >> had the chance to run release validation tests on them. The extension
>> >> will allow us to validate and vote by Monday.
>> >> Is this acceptable to the community?
>> >>
>> >> Best
>> >> -P.
>> >>
>> >> On Fri, Jun 8, 2018 at 6:20 AM Alexey Romanenko
>> >> mailto:aromanenko@gmail.com>> wrote:
>> >>
>> >>Thank you JB for your work!
>> >>
>> >>I tested running simple streaming (/KafkaIO/) and batch (/TextIO /
>> >>HDFS/) pipelines with SparkRunner on YARN cluster - it works fine.
>> >>
>> >>WBR,
>> >>Alexey
>> >>
>> >>
>> >>>On 8 Jun 2018, at 10:00, Etienne Chauchot > >>>> wrote:
>> >>>
>> >>>I forgot to vote:
>> >>>+1 (non binding).
>> >>>What I tested:
>> >>>- no functional or performance regression comparing to v2.4
>> >>>- dependencies in the poms are ok
>> >>>
>> >>>Etienne
>> Le vendredi 08 juin 2018 à 08:27 +0200, Romain Manni-Bucau a
>> écrit :
>> +1 (non-binding), mainstream usage is not broken by the pom
>> changes and runtime has no known regression compared to the 2.4.0
>> 
>> (side note: kudo to JB for this build tool change release, I know
>> how it can hurt ;))
>> 
>> Romain Manni-Bucau
>> @rmannibucau  |  Blog
>>  | Old Blog
>>  | Github
>>  | LinkedIn
>>  | Book
>> <
>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>> >
>> 
>> 
>> Le jeu. 7 juin 2018 à 16:17, Jean-Baptiste Onofré
>> mailto:j...@nanthrax.net>> a écrit :
>> >Thanks for the details Etienne !
>> >
>> >The good news is that the artifacts seem OK and the overall
>> Nexmark
>> >results are consistent with the 2.4.0 release ones.
>> >
>> >I'm starting a complete review using the beam-samples as well.
>> >
>> >Regards
>> >JB
>> >
>> >>On 07/06/2018 16:14, Etienne Chauchot wrote:
>> >> Hi,
>> >> I've just run the nexmark queries on v2.5.0-RC1 tag
>> >> What we can notice:
>> >> - query 3 (exercises CoGroupByKey, state and timer) shows
>> >different
>> >> output with DR between batch and streaming and with the other
>> >runners =>
>> >> I compared with v2.4 there were still these differences but with
>> >> different output size numbers
>> >>
>> >> - query 6 (exercises specialized combiner) shows different output
>> >> between the runners => the correct output is 401. strange that
>> >in batch
>> >> mode some runners output les Sellers. I compared with v2.4
>> >same output
>> >>
>> >> - response time of query 7 (exercices Max transform, fanout
>> >and side
>> >> input) is very slow on DR => I compared with v2.4 , comparable
>> >execution
>> >> times

Re: [VOTE] Policies for managing Beam dependencies

2018-06-11 Thread Chamikara Jayalath
Hi All,

Based on the vote (3 PMC +1s and no -1s) and based on the discussions in
the doc (seems to be mostly positive), I think we can go ahead and
implement some of the policies discussed so far.

I have given some of the potential action items below.

* Automatically generate human readable reports on status of Beam
dependencies weekly and share in dev list.
* Create JIRAs for significantly outdated dependencies based on above
reports.
* Copy some of the component level dependency version declarations to top
level.
* Try to identify owners for dependencies and specify owners in comments
close to dependency declarations.
* Vendor any dependencies that can cause issues if leaked to other
components.
* Add policies discussed so far to the Web site along with reasoning (from
doc).

Of course, I'm happy to refine or add to these polices as needed.

Thanks,
Cham


On Thu, Jun 7, 2018 at 9:40 AM Lukasz Cwik  wrote:

> +1
>
> On Thu, Jun 7, 2018 at 5:18 AM Kenneth Knowles  wrote:
>
>> +1 to these. Thanks for clarifying!
>>
>> Kenn
>>
>> On Wed, Jun 6, 2018 at 10:40 PM Chamikara Jayalath 
>> wrote:
>>
>>> Hi Kenn,
>>>
>>> On Wed, Jun 6, 2018 at 8:14 PM Kenneth Knowles  wrote:
>>>
 +0.5

 I like the spirit of these policies. I think they need a little wording
 work. Comments inline.

 On Wed, Jun 6, 2018 at 4:53 PM, Chamikara Jayalath <
> chamik...@google.com> wrote:
>>
>>
>> (1) Human readable reports on status of Beam dependencies are
>> generated weekly and shared with the Beam community through the dev list.
>>
>
 Who is responsible for generating these? The mechanism or
 responsibility should be made clear.

 I clicked through a doc -> thread -> doc to find even some details. It
 looks like manual run of a gradle command was adopted. So the
 responsibility needs an owner, even if it is "unspecified volunteer on dev@
 and feel free to complain or do it yourself if you don't see it"

>>>
>>> This is described in following doc (referenced by my doc).
>>>
>>> https://docs.google.com/document/d/1rqr_8a9NYZCgeiXpTIwWLCL7X8amPAVfRXsO72BpBwA/edit#
>>>
>>> Proposal is to run an automated Jenkins job that is run weekly, so no
>>> need for someone to manually generate these reports.
>>>
>>>

 (2) Beam components should define dependencies and their versions at
>> the top level.
>>
>
 I think the big "should" works better with some guidance about when
 something might be an exception, or at least explicit mention that there
 can be rare exceptions. Unless you think that is never the case. If there
 are no exceptions, then say "must" and if we hit a roadblock we can revisit
 the policy.

>>>
>>> The idea was to allow exceptions. Added more details to the doc.
>>>
>>>


 (3) A significantly outdated dependency (identified manually or through
>> tooling) should result in a JIRA that is a blocker for the next release.
>> Release manager may choose to push the blocker to the subsequent release 
>> or
>> downgrade from a blocker.
>>
>
 How is "significantly outdated" defined? By dev@ discussion? Seems
 like the right way. Anyhow that's what will happen in practice as people
 debate the blocker bug.

>>>
>>> This will be either through the automated Jenkins job (see the doc
>>> above, where the proposal is to flag new major versions and new minor
>>> versions that are more than six months old) or manually (for any critical
>>> updates that will not be captured by the Jenkins job) (more details in the
>>> doc). Manually identified critical dependency updates may involve a
>>> discussion in the dev list.
>>>
>>>


 (4) Dependency declarations may identify owners that are responsible
>> for upgrading the respective dependencies.
>>
>> (5) Dependencies of Java SDK components that may cause issues to
>> other components if leaked should be shaded.
>>
>
 We previously agreed upon our intent to migrate to "pre-shaded" aka
 "vendored" packages:
 https://lists.apache.org/thread.html/12383d2e5d70026427df43294e30d6524334e16f03d86c9a5860792f@%3Cdev.beam.apache.org%3E

 With Maven, this involved a lot of boilerplate so I never did it. With
 Gradle, we can easily build a re-usable rule to create such a package in a
 couple of lines. I just opened the first WIP PR here:
 https://github.com/apache/beam/pull/5570 it is blocked by deleting the
 poms anyhow so by then we should have a configuration that works to vendor
 our currently shaded artifacts.

 So I think this should be rephrased to "should be vendored" so we don't
 have to revise the policy.

>>>
>>> Thanks for the pointer. I agree that vendoring is a good approach.
>>>
>>> Here are the updated policies (and more details added to doc). I agree
>>> with Ahmet's point that votes should be converted to web sites where we 

Re: [DISCUSS] Use Confluence wiki for non-user-facing stuff

2018-06-11 Thread Mikhail Gryzykhin
+1 for having a wiki.

One comment:
Is there a specific reason for it to be a Confluence engine?

I know that we use JIRA by Atlassian, but we also utilize Github that has
its own wiki engine that is closer to code and much more lightweight.

I would prefer wiki hosted on Github.

--Mikhail



On Mon, Jun 11, 2018 at 2:40 PM Kenneth Knowles  wrote:

> OK, yea, that all makes sense to me. Like this?
>
>  - site/documentation: writing just for users
>  - site/contribute: basic stuff as-is, writing for users to entice them,
> links to the next...
>  - wiki/contributors: contributors writing just for each other
>
> And you also have
>
>  - wiki/users: users writing for users
>
> That's interesting.
>
> Kenn
>
>
>
> On Mon, Jun 11, 2018 at 2:30 PM Robert Bradshaw 
> wrote:
>
>> On Fri, Jun 8, 2018 at 2:18 PM Kenneth Knowles  wrote:
>>
>>
>>> I disagree strongly here - I don't think the wiki will have appropriate
>>> polish for users. Even if carefully polished I don't think the presentation
>>> style is right, and it is not flexible. Power users will find it, of course.
>>>
>>
>> I wasn't imagining a wiki as a platform for developers to author
>> documentation, rather a place for users to author content for other users
>> (tips and tricks, handy PTransforms, etc.) at a much lower bar than
>> expecting users to go in and update our documentation. I agree with the
>> goal of not (further) fragmenting our documentation.
>>
>> As for mixing contributor vs. user information on the same site, I think
>> it's valuable to have some integration and treat the two as a continuum
>> (after all, our (direct) users are already developers) and consider it an
>> asset to have a "contribute" heading right in the main site. (Perhaps, if
>> it's confusing, we could move it all the way to the right.) I don't think
>> we'll be doing ourselves a favor by blinding copying all the existing docs
>> to a wiki. That being said I think it makes sense to start playing with
>> using a wiki, and see how much value that adds on top of what we already
>> have.
>>
>>
>>>
>>>
 On Fri, Jun 8, 2018 at 12:05 PM Thomas Weise  wrote:

> +1 most of the contributor material could live on Wiki and there it
> will be easier to maintain (perhaps the lower bar for updates will lead to
> more information and increased maintenance). Contribution policy related
> material should remain on the website and go through proper
> review/versioning.
>
>
> On Fri, Jun 8, 2018 at 11:44 AM, Udi Meiri  wrote:
>
>> (a) Yes.
>> (b) I'm interested in putting documentation for contributors there.
>> (test triage guide, precommit and postcommit guidelines, processes, etc.)
>> It'd be faster than having to go through the motions of a github pull
>> request and a review process.
>> (c) Anything that goes to a wide audience, such as SDK users. That
>> would need review.
>>
>> Also, have you looked at https://wiki.apache.org/general/ ? (not
>> sure if that's Confluence)
>>
>>
>> On Fri, Jun 8, 2018 at 10:07 AM Andrew Pilloud 
>> wrote:
>>
>>> +1 It would be really nice to have a lightweight place to share that
>>> is more searchable then random Google docs.
>>>
>>> Andrew
>>>
>>> On Fri, Jun 8, 2018 at 9:35 AM Anton Kedin  wrote:
>>>
 +1

 (a) we should;
 (b) I think it will be a good place for all of the things you list;
 (c) introductory things, like "getting started", or "programming
 guide" that people not deeply involved in the project would expect to 
 find
 on beam.apache.org should stay there, not in the wiki;

 On Fri, Jun 8, 2018 at 12:56 AM Etienne Chauchot <
 echauc...@apache.org> wrote:

> Hi Kenn,
> I'm +1 of course. I remember that you and I and others discussed
> in a similar thread about dev facing docs but it got lost at some 
> point in
> time.
> IMHO
>
> I would add
> - runners specifics (e.g. how runners implement state or timer,
> how they split data into bundles, etc...)
> - probably putting online the doc I did for nexmark that
> summarizes the architecture and pseudo code of the queries (because 
> some
> are several thousand lines of code). I often use it to recall what a 
> given
> query does.
>
> I would remove
>  - BIPs / summaries of collections of JIRA
> because it is hard to maintain and will end up being out of date I
> think.
>
> Etienne
>
> Le jeudi 07 juin 2018 à 13:23 -0700, Kenneth Knowles a écrit :
>
> Hi all,
>
> I've been in half a dozen conversations recently about whether to
> have a wiki and what to use it for. Some things I've heard:
>
>  - "why is all this 

Re: [DISCUSS] Use Confluence wiki for non-user-facing stuff

2018-06-11 Thread Kenneth Knowles
OK, yea, that all makes sense to me. Like this?

 - site/documentation: writing just for users
 - site/contribute: basic stuff as-is, writing for users to entice them,
links to the next...
 - wiki/contributors: contributors writing just for each other

And you also have

 - wiki/users: users writing for users

That's interesting.

Kenn



On Mon, Jun 11, 2018 at 2:30 PM Robert Bradshaw  wrote:

> On Fri, Jun 8, 2018 at 2:18 PM Kenneth Knowles  wrote:
>
>
>> I disagree strongly here - I don't think the wiki will have appropriate
>> polish for users. Even if carefully polished I don't think the presentation
>> style is right, and it is not flexible. Power users will find it, of course.
>>
>
> I wasn't imagining a wiki as a platform for developers to author
> documentation, rather a place for users to author content for other users
> (tips and tricks, handy PTransforms, etc.) at a much lower bar than
> expecting users to go in and update our documentation. I agree with the
> goal of not (further) fragmenting our documentation.
>
> As for mixing contributor vs. user information on the same site, I think
> it's valuable to have some integration and treat the two as a continuum
> (after all, our (direct) users are already developers) and consider it an
> asset to have a "contribute" heading right in the main site. (Perhaps, if
> it's confusing, we could move it all the way to the right.) I don't think
> we'll be doing ourselves a favor by blinding copying all the existing docs
> to a wiki. That being said I think it makes sense to start playing with
> using a wiki, and see how much value that adds on top of what we already
> have.
>
>
>>
>>
>>> On Fri, Jun 8, 2018 at 12:05 PM Thomas Weise  wrote:
>>>
 +1 most of the contributor material could live on Wiki and there it
 will be easier to maintain (perhaps the lower bar for updates will lead to
 more information and increased maintenance). Contribution policy related
 material should remain on the website and go through proper
 review/versioning.


 On Fri, Jun 8, 2018 at 11:44 AM, Udi Meiri  wrote:

> (a) Yes.
> (b) I'm interested in putting documentation for contributors there.
> (test triage guide, precommit and postcommit guidelines, processes, etc.)
> It'd be faster than having to go through the motions of a github pull
> request and a review process.
> (c) Anything that goes to a wide audience, such as SDK users. That
> would need review.
>
> Also, have you looked at https://wiki.apache.org/general/ ? (not sure
> if that's Confluence)
>
>
> On Fri, Jun 8, 2018 at 10:07 AM Andrew Pilloud 
> wrote:
>
>> +1 It would be really nice to have a lightweight place to share that
>> is more searchable then random Google docs.
>>
>> Andrew
>>
>> On Fri, Jun 8, 2018 at 9:35 AM Anton Kedin  wrote:
>>
>>> +1
>>>
>>> (a) we should;
>>> (b) I think it will be a good place for all of the things you list;
>>> (c) introductory things, like "getting started", or "programming
>>> guide" that people not deeply involved in the project would expect to 
>>> find
>>> on beam.apache.org should stay there, not in the wiki;
>>>
>>> On Fri, Jun 8, 2018 at 12:56 AM Etienne Chauchot <
>>> echauc...@apache.org> wrote:
>>>
 Hi Kenn,
 I'm +1 of course. I remember that you and I and others discussed in
 a similar thread about dev facing docs but it got lost at some point in
 time.
 IMHO

 I would add
 - runners specifics (e.g. how runners implement state or timer, how
 they split data into bundles, etc...)
 - probably putting online the doc I did for nexmark that summarizes
 the architecture and pseudo code of the queries (because some are 
 several
 thousand lines of code). I often use it to recall what a given query 
 does.

 I would remove
  - BIPs / summaries of collections of JIRA
 because it is hard to maintain and will end up being out of date I
 think.

 Etienne

 Le jeudi 07 juin 2018 à 13:23 -0700, Kenneth Knowles a écrit :

 Hi all,

 I've been in half a dozen conversations recently about whether to
 have a wiki and what to use it for. Some things I've heard:

  - "why is all this stuff that users don't care about here?"
  - "can we have a lighter weight place to put technical references
 for contributors"

 So I want to consider as a community starting up our wiki. Ideas
 for what could go there:

  - Collection of links to design docs like
 https://beam.apache.org/contribute/design-documents/
  - Specialized walkthroughs like
 https://beam.apache.org/contribute/docker-images/
  - 

Re: [DISCUSS] Use Confluence wiki for non-user-facing stuff

2018-06-11 Thread Robert Bradshaw
On Fri, Jun 8, 2018 at 2:18 PM Kenneth Knowles  wrote:


> I disagree strongly here - I don't think the wiki will have appropriate
> polish for users. Even if carefully polished I don't think the presentation
> style is right, and it is not flexible. Power users will find it, of course.
>

I wasn't imagining a wiki as a platform for developers to author
documentation, rather a place for users to author content for other users
(tips and tricks, handy PTransforms, etc.) at a much lower bar than
expecting users to go in and update our documentation. I agree with the
goal of not (further) fragmenting our documentation.

As for mixing contributor vs. user information on the same site, I think
it's valuable to have some integration and treat the two as a continuum
(after all, our (direct) users are already developers) and consider it an
asset to have a "contribute" heading right in the main site. (Perhaps, if
it's confusing, we could move it all the way to the right.) I don't think
we'll be doing ourselves a favor by blinding copying all the existing docs
to a wiki. That being said I think it makes sense to start playing with
using a wiki, and see how much value that adds on top of what we already
have.


>
>
>> On Fri, Jun 8, 2018 at 12:05 PM Thomas Weise  wrote:
>>
>>> +1 most of the contributor material could live on Wiki and there it will
>>> be easier to maintain (perhaps the lower bar for updates will lead to more
>>> information and increased maintenance). Contribution policy related
>>> material should remain on the website and go through proper
>>> review/versioning.
>>>
>>>
>>> On Fri, Jun 8, 2018 at 11:44 AM, Udi Meiri  wrote:
>>>
 (a) Yes.
 (b) I'm interested in putting documentation for contributors there.
 (test triage guide, precommit and postcommit guidelines, processes, etc.)
 It'd be faster than having to go through the motions of a github pull
 request and a review process.
 (c) Anything that goes to a wide audience, such as SDK users. That
 would need review.

 Also, have you looked at https://wiki.apache.org/general/ ? (not sure
 if that's Confluence)


 On Fri, Jun 8, 2018 at 10:07 AM Andrew Pilloud 
 wrote:

> +1 It would be really nice to have a lightweight place to share that
> is more searchable then random Google docs.
>
> Andrew
>
> On Fri, Jun 8, 2018 at 9:35 AM Anton Kedin  wrote:
>
>> +1
>>
>> (a) we should;
>> (b) I think it will be a good place for all of the things you list;
>> (c) introductory things, like "getting started", or "programming
>> guide" that people not deeply involved in the project would expect to 
>> find
>> on beam.apache.org should stay there, not in the wiki;
>>
>> On Fri, Jun 8, 2018 at 12:56 AM Etienne Chauchot <
>> echauc...@apache.org> wrote:
>>
>>> Hi Kenn,
>>> I'm +1 of course. I remember that you and I and others discussed in
>>> a similar thread about dev facing docs but it got lost at some point in
>>> time.
>>> IMHO
>>>
>>> I would add
>>> - runners specifics (e.g. how runners implement state or timer, how
>>> they split data into bundles, etc...)
>>> - probably putting online the doc I did for nexmark that summarizes
>>> the architecture and pseudo code of the queries (because some are 
>>> several
>>> thousand lines of code). I often use it to recall what a given query 
>>> does.
>>>
>>> I would remove
>>>  - BIPs / summaries of collections of JIRA
>>> because it is hard to maintain and will end up being out of date I
>>> think.
>>>
>>> Etienne
>>>
>>> Le jeudi 07 juin 2018 à 13:23 -0700, Kenneth Knowles a écrit :
>>>
>>> Hi all,
>>>
>>> I've been in half a dozen conversations recently about whether to
>>> have a wiki and what to use it for. Some things I've heard:
>>>
>>>  - "why is all this stuff that users don't care about here?"
>>>  - "can we have a lighter weight place to put technical references
>>> for contributors"
>>>
>>> So I want to consider as a community starting up our wiki. Ideas for
>>> what could go there:
>>>
>>>  - Collection of links to design docs like
>>> https://beam.apache.org/contribute/design-documents/
>>>  - Specialized walkthroughs like
>>> https://beam.apache.org/contribute/docker-images/
>>>  - Best-effort notes that just try to help out like
>>> https://beam.apache.org/contribute/intellij/
>>>  - Docs on in-progress stuff like
>>> https://beam.apache.org/documentation/runners/jstorm/
>>>  - Expanded instructions for committers, more than
>>> https://beam.apache.org/contribute/committer-guide/
>>>  - BIPs / summaries of collections of JIRA
>>>  - Docs sitting in markdown in the repo like
>>> https://github.com/apache/beam/blob/master/sdks/CONTAINERS.md and
>>> 

Re: Building and visualizing the Beam SQL graph

2018-06-11 Thread Mingmin Xu
EXPLAIN shows the execution plan in SQL perspective only. After converting
to a Beam composite PTransform, there're more steps underneath, each Runner
re-org Beam PTransforms again which makes the final pipeline hard to read.
In SQL module itself, I don't see any difference between `toPTransform` and
`toPCollection`. We could have an easy-to-understand step name when
converting RelNodes, but Runners show the graph to developers.

Mingmin

On Mon, Jun 11, 2018 at 2:06 PM, Andrew Pilloud  wrote:

> That sounds correct. And because each rel node might have a different
> input there isn't a standard interface (like PTransform,
> PCollection> toPTransform());
>
> Andrew
>
> On Mon, Jun 11, 2018 at 1:31 PM Kenneth Knowles  wrote:
>
>> Agree with that. It will be kind of tricky to generalize. I think there
>> are some criteria in this case that might apply in other cases:
>>
>> 1. Each rel node (or construct of a DSL) should have a PTransform for how
>> it computes its result from its inputs.
>> 2. The inputs to that PTransform should actually be the inputs to the rel
>> node!
>>
>> So I tried to improve #1 but I probably made #2 worse.
>>
>> Kenn
>>
>> On Mon, Jun 11, 2018 at 12:53 PM Anton Kedin  wrote:
>>
>>> Not answering the original question, but doesn't "explain" satisfy the
>>> SQL use case?
>>>
>>> Going forward we probably want to solve this in a more general way. We
>>> have at least 3 ways to represent the pipeline:
>>>  - how runner executes it;
>>>  - what it looks like when constructed;
>>>  - what the user was describing in DSL;
>>> And there will probably be more, if extra layers are built on top of
>>> DSLs.
>>>
>>> If possible, we probably should be able to map any level of abstraction
>>> to any other to better understand and debug the pipelines.
>>>
>>>
>>> On Mon, Jun 11, 2018 at 12:17 PM Kenneth Knowles  wrote:
>>>
 In other words, revert https://github.com/apache/beam/pull/4705/files,
 at least in spirit? I agree :-)

 Kenn

 On Mon, Jun 11, 2018 at 11:39 AM Andrew Pilloud 
 wrote:

> We are currently converting the Calcite Rel tree to Beam by
> recursively building a tree of nested PTransforms. This results in a weird
> nested graph in the dataflow UI where each node contains its inputs nested
> inside of it. I'm going to change the internal data structure for
> converting the tree from a PTransform to a PCollection, which will result
> in a more accurate representation of the tree structure being built and
> should simplify the code as well. This will not change the public 
> interface
> to SQL, which will remain a PTransform. Any thoughts or objections?
>
> I was also wondering if there are tools for visualizing the Beam graph
> aside from the dataflow runner UI. What other tools exist?
>
> Andrew
>



-- 

Mingmin


Re: Building and visualizing the Beam SQL graph

2018-06-11 Thread Andrew Pilloud
That sounds correct. And because each rel node might have a different input
there isn't a standard interface (like PTransform,
PCollection> toPTransform());

Andrew

On Mon, Jun 11, 2018 at 1:31 PM Kenneth Knowles  wrote:

> Agree with that. It will be kind of tricky to generalize. I think there
> are some criteria in this case that might apply in other cases:
>
> 1. Each rel node (or construct of a DSL) should have a PTransform for how
> it computes its result from its inputs.
> 2. The inputs to that PTransform should actually be the inputs to the rel
> node!
>
> So I tried to improve #1 but I probably made #2 worse.
>
> Kenn
>
> On Mon, Jun 11, 2018 at 12:53 PM Anton Kedin  wrote:
>
>> Not answering the original question, but doesn't "explain" satisfy the
>> SQL use case?
>>
>> Going forward we probably want to solve this in a more general way. We
>> have at least 3 ways to represent the pipeline:
>>  - how runner executes it;
>>  - what it looks like when constructed;
>>  - what the user was describing in DSL;
>> And there will probably be more, if extra layers are built on top of DSLs.
>>
>> If possible, we probably should be able to map any level of abstraction
>> to any other to better understand and debug the pipelines.
>>
>>
>> On Mon, Jun 11, 2018 at 12:17 PM Kenneth Knowles  wrote:
>>
>>> In other words, revert https://github.com/apache/beam/pull/4705/files,
>>> at least in spirit? I agree :-)
>>>
>>> Kenn
>>>
>>> On Mon, Jun 11, 2018 at 11:39 AM Andrew Pilloud 
>>> wrote:
>>>
 We are currently converting the Calcite Rel tree to Beam by recursively
 building a tree of nested PTransforms. This results in a weird nested graph
 in the dataflow UI where each node contains its inputs nested inside of it.
 I'm going to change the internal data structure for converting the tree
 from a PTransform to a PCollection, which will result in a more accurate
 representation of the tree structure being built and should simplify the
 code as well. This will not change the public interface to SQL, which will
 remain a PTransform. Any thoughts or objections?

 I was also wondering if there are tools for visualizing the Beam graph
 aside from the dataflow runner UI. What other tools exist?

 Andrew

>>>


Re: [DISCUSS] [BEAM-4126] Deleting Maven build files (pom.xml) grace period?

2018-06-11 Thread Lukasz Cwik
Thanks all, it seems as though only Google needs the grace period. I'll
wait for the shorter of BEAM-4512 or two weeks before merging
https://github.com/apache/beam/pull/5571


On Wed, Jun 6, 2018 at 8:29 PM Kenneth Knowles  wrote:

> +1
>
> Definitely a good opportunity to decouple your build tools from your
> dependencies' build tools.
>
> On Wed, Jun 6, 2018 at 2:42 PM Ted Yu  wrote:
>
>> +1 on this effort
>>
>>  Original message 
>> From: Chamikara Jayalath 
>> Date: 6/6/18 2:09 PM (GMT-08:00)
>> To: dev@beam.apache.org, u...@beam.apache.org
>> Subject: Re: [DISCUSS] [BEAM-4126] Deleting Maven build files (pom.xml)
>> grace period?
>>
>> +1 for the overall effort. As Pablo mentioned, we need some time to
>> migrate internal Dataflow build off of Maven build files. I created
>> https://issues.apache.org/jira/browse/BEAM-4512 for this.
>>
>> Thanks,
>> Cham
>>
>> On Wed, Jun 6, 2018 at 1:30 PM Eugene Kirpichov 
>> wrote:
>>
>>> Is it possible for Dataflow to just keep a copy of the pom.xmls and
>>> delete it as soon as Dataflow is migrated?
>>>
>>> Overall +1, I've been using Gradle without issues for a while and almost
>>> forgot pom.xml's still existed.
>>>
>>> On Wed, Jun 6, 2018, 1:13 PM Pablo Estrada  wrote:
>>>
 I agree that we should delete the pom.xml files soon, as they create a
 burden for maintainers.

 I'd like to be able to extend the grace period by a bit, to allow the
 internal build systems at Google to move away from using the Beam poms.

 We use these pom files to build Dataflow workers, and thus it's
 critical for us that they are available for a few more weeks while we set
 up a gradle build. Perhaps 4 weeks?
 (Calling out+Chamikara Jayalath  who has
 recently worked on internal Dataflow tooling.)

 Best
 -P.

 On Wed, Jun 6, 2018 at 1:05 PM Lukasz Cwik  wrote:

> Note: Apache Beam will still provide pom.xml for each release it
> produces. This is only about people using Maven to build Apache Beam
> themselves and not relying on the released artifacts in Maven Central.
>
> With the first release using Gradle as the build system is underway, I
> wanted to start this thread to remind people that we are going to delete
> the Maven pom.xml files after the 2.5.0 release is finalized plus a two
> week grace period.
>
> Are there others who would like a shorter/longer grace period?
>
> The PR to delete the pom.xml is here:
> https://github.com/apache/beam/pull/5571
>
 --
 Got feedback? go/pabloem-feedback
 

>>>


Re: Building and visualizing the Beam SQL graph

2018-06-11 Thread Andrew Pilloud
Not quite a revert, we still want to keep the actual transformation inside
a PTransform but the input of that PTransform will be different for each
node type (joins have multiple inputs for example). We have this function
as our builder right now:

PTransform> toPTransform();

When I'm done we will have this:

PCollection toPCollection(Pipeline pipeline);

Andrew

On Mon, Jun 11, 2018 at 12:53 PM Anton Kedin  wrote:

> Not answering the original question, but doesn't "explain" satisfy the SQL
> use case?
>
> Going forward we probably want to solve this in a more general way. We
> have at least 3 ways to represent the pipeline:
>  - how runner executes it;
>  - what it looks like when constructed;
>  - what the user was describing in DSL;
> And there will probably be more, if extra layers are built on top of DSLs.
>
> If possible, we probably should be able to map any level of abstraction to
> any other to better understand and debug the pipelines.
>
>
> On Mon, Jun 11, 2018 at 12:17 PM Kenneth Knowles  wrote:
>
>> In other words, revert https://github.com/apache/beam/pull/4705/files,
>> at least in spirit? I agree :-)
>>
>> Kenn
>>
>> On Mon, Jun 11, 2018 at 11:39 AM Andrew Pilloud 
>> wrote:
>>
>>> We are currently converting the Calcite Rel tree to Beam by recursively
>>> building a tree of nested PTransforms. This results in a weird nested graph
>>> in the dataflow UI where each node contains its inputs nested inside of it.
>>> I'm going to change the internal data structure for converting the tree
>>> from a PTransform to a PCollection, which will result in a more accurate
>>> representation of the tree structure being built and should simplify the
>>> code as well. This will not change the public interface to SQL, which will
>>> remain a PTransform. Any thoughts or objections?
>>>
>>> I was also wondering if there are tools for visualizing the Beam graph
>>> aside from the dataflow runner UI. What other tools exist?
>>>
>>> Andrew
>>>
>>


Re: Building and visualizing the Beam SQL graph

2018-06-11 Thread Kenneth Knowles
Agree with that. It will be kind of tricky to generalize. I think there are
some criteria in this case that might apply in other cases:

1. Each rel node (or construct of a DSL) should have a PTransform for how
it computes its result from its inputs.
2. The inputs to that PTransform should actually be the inputs to the rel
node!

So I tried to improve #1 but I probably made #2 worse.

Kenn

On Mon, Jun 11, 2018 at 12:53 PM Anton Kedin  wrote:

> Not answering the original question, but doesn't "explain" satisfy the SQL
> use case?
>
> Going forward we probably want to solve this in a more general way. We
> have at least 3 ways to represent the pipeline:
>  - how runner executes it;
>  - what it looks like when constructed;
>  - what the user was describing in DSL;
> And there will probably be more, if extra layers are built on top of DSLs.
>
> If possible, we probably should be able to map any level of abstraction to
> any other to better understand and debug the pipelines.
>
>
> On Mon, Jun 11, 2018 at 12:17 PM Kenneth Knowles  wrote:
>
>> In other words, revert https://github.com/apache/beam/pull/4705/files,
>> at least in spirit? I agree :-)
>>
>> Kenn
>>
>> On Mon, Jun 11, 2018 at 11:39 AM Andrew Pilloud 
>> wrote:
>>
>>> We are currently converting the Calcite Rel tree to Beam by recursively
>>> building a tree of nested PTransforms. This results in a weird nested graph
>>> in the dataflow UI where each node contains its inputs nested inside of it.
>>> I'm going to change the internal data structure for converting the tree
>>> from a PTransform to a PCollection, which will result in a more accurate
>>> representation of the tree structure being built and should simplify the
>>> code as well. This will not change the public interface to SQL, which will
>>> remain a PTransform. Any thoughts or objections?
>>>
>>> I was also wondering if there are tools for visualizing the Beam graph
>>> aside from the dataflow runner UI. What other tools exist?
>>>
>>> Andrew
>>>
>>


Re: Multimap PCollectionViews' values udpated rather than appended

2018-06-11 Thread Lukasz Cwik
Thanks for the snippet, updated BEAM-4470 with the additional details.

On Mon, Jun 11, 2018 at 10:56 AM Carlos Alonso  wrote:

> Many thanks for your help. Actually, my use case emits the entire map
> everytime, so I guess I'm good to go with discarding mode.
>
> This test reproduces the issue:
> https://github.com/calonso/beam_experiments/blob/master/refreshingsideinput/src/test/scala/com/mrcalonso/RefreshingSideInput2Test.scala#L19-L53
>
> Hope it helps
>
> On Mon, Jun 4, 2018 at 9:04 PM Lukasz Cwik  wrote:
>
>> Carlos, can you provide a test/code snippet for the bug that shows the
>> issue?
>>
>> On Mon, Jun 4, 2018 at 11:57 AM Lukasz Cwik  wrote:
>>
>>> +dev@beam.apache.org
>>> Note that this is likely a bug in the DirectRunner for accumulation
>>> mode, filed: https://issues.apache.org/jira/browse/BEAM-4470
>>>
>>> Discarding mode is meant to always be the latest firing, the issue
>>> though is that you need to emit the entire map every time. If you can do
>>> this, then it makes sense to use discarding mode. The issue with discarding
>>> mode is that if your first trigger firing produces (A, 1), (B, 1) and your
>>> second firing produces (B, 2), the multimap will only contain (B, 2) and
>>> (A, 1) will have been discarded.
>>>
>>> To my knowledge, there is no guarantee about the order in which the
>>> values are combined. You will need to use some piece of information about
>>> the element to figure out which is the latest (or encode some additional
>>> information along with each element to make this easy).
>>>
>>> On Thu, May 31, 2018 at 9:16 AM Carlos Alonso 
>>> wrote:
>>>
 I've improved the example a little and added some tests
 https://github.com/calonso/beam_experiments/blob/master/refreshingsideinput/src/test/scala/com/mrcalonso/RefreshingSideInput2Test.scala

 The behaviour is slightly different, which is possibly because of the
 different runners (Dataflow/Direct) implementations, but still not working.

 Now what happens is that although the internal PCollection gets
 updated, the view isn't. This is happening regardless of the accumulation
 mode.

 Regarding the accumulation mode on Dataflow... That was it!! Now the
 sets contain all the items, however, one more question, is the ordering
 within the set deterministic? (i.e: Can I assume that the latest will
 always be on the last position of the Iterable object?)

 Also... given that for my particular case I only want the latest
 version, would you advice me to go ahead with Discarding mode?

 Regards

 On Thu, May 31, 2018 at 4:44 PM Lukasz Cwik  wrote:

> The trigger definition in the sample code you have is using discarding
> firing mode. Try swapping to using accumulating mode.
>
>
> On Thu, May 31, 2018 at 1:42 AM Carlos Alonso 
> wrote:
>
>> But I think what I'm experiencing is quite different. Basically the
>> side input is updated, but only one element is found on the Iterable that
>> is the value of any key of the multimap.
>>
>> I mean, no concatenation seems to be happening. On the linked thread,
>> Kenn suggests that every firing will add the new value to the set of 
>> values
>> for the emitted key, but what I'm experiencing is that the new value is
>> there, but just itself (i.e: is the only element in the set).
>>
>> @Robert, I'm using
>> Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane())
>>
>> On Wed, May 30, 2018 at 7:46 PM Lukasz Cwik  wrote:
>>
>>> An alternative to the thread that Kenn linked (adding support for
>>> retractions) is to add explicit support for combiners into side inputs. 
>>> The
>>> system currently works by using a hardcoded concatenating combiner, so
>>> maps, lists, iterables, singletons, multimaps all work by concatenating 
>>> the
>>> set of values emitted and then turning it into a view which is why it 
>>> is an
>>> error for a singleton and map view if the trigger fires multiple times.
>>>
>>> On Wed, May 30, 2018 at 10:01 AM Kenneth Knowles 
>>> wrote:
>>>
 Yes, this is a known issue. Here's a prior discussion:
 https://lists.apache.org/thread.html/e9518f5d5f4bcf7bab02de2cb9fe1bd5293d87aa12d46de1eac4600b@%3Cuser.beam.apache.org%3E

 It is actually long-standing and the solution is known but hard.



 On Wed, May 30, 2018 at 9:48 AM Carlos Alonso 
 wrote:

> Hi everyone!!
>
> Working with multimap based side inputs on the global window I'm
> experiencing something unexpected (at least to me) that I'd like to 
> share
> with you to clarify.
>
> The way I understand multimaps is that when one emits two values
> for the same key for the same window (obvious thing here as I'm 
> working on
> 

Re: Building and visualizing the Beam SQL graph

2018-06-11 Thread Anton Kedin
Not answering the original question, but doesn't "explain" satisfy the SQL
use case?

Going forward we probably want to solve this in a more general way. We have
at least 3 ways to represent the pipeline:
 - how runner executes it;
 - what it looks like when constructed;
 - what the user was describing in DSL;
And there will probably be more, if extra layers are built on top of DSLs.

If possible, we probably should be able to map any level of abstraction to
any other to better understand and debug the pipelines.


On Mon, Jun 11, 2018 at 12:17 PM Kenneth Knowles  wrote:

> In other words, revert https://github.com/apache/beam/pull/4705/files, at
> least in spirit? I agree :-)
>
> Kenn
>
> On Mon, Jun 11, 2018 at 11:39 AM Andrew Pilloud 
> wrote:
>
>> We are currently converting the Calcite Rel tree to Beam by recursively
>> building a tree of nested PTransforms. This results in a weird nested graph
>> in the dataflow UI where each node contains its inputs nested inside of it.
>> I'm going to change the internal data structure for converting the tree
>> from a PTransform to a PCollection, which will result in a more accurate
>> representation of the tree structure being built and should simplify the
>> code as well. This will not change the public interface to SQL, which will
>> remain a PTransform. Any thoughts or objections?
>>
>> I was also wondering if there are tools for visualizing the Beam graph
>> aside from the dataflow runner UI. What other tools exist?
>>
>> Andrew
>>
>


Re: [VOTE] Apache Beam, version 2.5.0, release candidate #1

2018-06-11 Thread Alan Myrvold
+1 (non-binding)

tested some of the quickstarts

On Sun, Jun 10, 2018 at 1:39 AM Tim  wrote:

> Tested by our team:
> - mvn inclusion
> - Avro, ES, Hadoop IF IO
> - Pipelines run on Spark (Cloudera 5.12.0 YARN cluster)
> - Reviewed release notes
>
> +1
>
> Thanks also to everyone who helped get over the gradle hurdle and in
> particular to JB.
>
> Tim
>
> > On 9 Jun 2018, at 05:56, Jean-Baptiste Onofré  wrote:
> >
> > No problem Pablo.
> >
> > The vote period is a minimum, it can be extended as requested or if we
> > don't have the minimum of 3 binding votes.
> >
> > Regards
> > JB
> >
> >> On 09/06/2018 01:54, Pablo Estrada wrote:
> >> Hello all,
> >> I'd like to request an extension of the voting period until Monday
> >> evening (US time, so later in other geographical regions). This is
> >> because we were only now able to publish Dataflow Workers, and have not
> >> had the chance to run release validation tests on them. The extension
> >> will allow us to validate and vote by Monday.
> >> Is this acceptable to the community?
> >>
> >> Best
> >> -P.
> >>
> >> On Fri, Jun 8, 2018 at 6:20 AM Alexey Romanenko
> >> mailto:aromanenko@gmail.com>> wrote:
> >>
> >>Thank you JB for your work!
> >>
> >>I tested running simple streaming (/KafkaIO/) and batch (/TextIO /
> >>HDFS/) pipelines with SparkRunner on YARN cluster - it works fine.
> >>
> >>WBR,
> >>Alexey
> >>
> >>
> >>>On 8 Jun 2018, at 10:00, Etienne Chauchot  >>>> wrote:
> >>>
> >>>I forgot to vote:
> >>>+1 (non binding).
> >>>What I tested:
> >>>- no functional or performance regression comparing to v2.4
> >>>- dependencies in the poms are ok
> >>>
> >>>Etienne
> Le vendredi 08 juin 2018 à 08:27 +0200, Romain Manni-Bucau a écrit
> :
> +1 (non-binding), mainstream usage is not broken by the pom
> changes and runtime has no known regression compared to the 2.4.0
> 
> (side note: kudo to JB for this build tool change release, I know
> how it can hurt ;))
> 
> Romain Manni-Bucau
> @rmannibucau  |  Blog
>  | Old Blog
>  | Github
>  | LinkedIn
>  | Book
> <
> https://www.packtpub.com/application-development/java-ee-8-high-performance
> >
> 
> 
> Le jeu. 7 juin 2018 à 16:17, Jean-Baptiste Onofré
> mailto:j...@nanthrax.net>> a écrit :
> >Thanks for the details Etienne !
> >
> >The good news is that the artifacts seem OK and the overall
> Nexmark
> >results are consistent with the 2.4.0 release ones.
> >
> >I'm starting a complete review using the beam-samples as well.
> >
> >Regards
> >JB
> >
> >>On 07/06/2018 16:14, Etienne Chauchot wrote:
> >> Hi,
> >> I've just run the nexmark queries on v2.5.0-RC1 tag
> >> What we can notice:
> >> - query 3 (exercises CoGroupByKey, state and timer) shows
> >different
> >> output with DR between batch and streaming and with the other
> >runners =>
> >> I compared with v2.4 there were still these differences but with
> >> different output size numbers
> >>
> >> - query 6 (exercises specialized combiner) shows different output
> >> between the runners => the correct output is 401. strange that
> >in batch
> >> mode some runners output les Sellers. I compared with v2.4
> >same output
> >>
> >> - response time of query 7 (exercices Max transform, fanout
> >and side
> >> input) is very slow on DR => I compared with v2.4 , comparable
> >execution
> >> times
> >>
> >> I'm not comparing q10 because it is a write to GCS so it is
> >very specific.
> >>
> >> => Basically no regression comparing to v2.4
> >>
> >> For the record here is the output (waiting for ongoing perfkit
> >integration):
> >>
> >>
> >> 1. DR batch
> >>
> >> Performance:
> >>
> >>
> >Conf  Runtime(sec)(Baseline)  Events(/sec)(Baseline)
>  Results(Baseline)
> >>
> >>
> >   5,8 17283,1
>   10
> >>
> >>
> >0001   3,2 31104,2
>92000
> >>
> >>
> >0002   1,2 82918,7
>  351
> >>
> >>
> >0003   2,2 46210,7
>  458
> >>
> >>
> >0004   1,2  8503,4
>   40
> >>
> >>
> >0005   4,0 25220,7
>   12
> >>
> >>
> >0006   0,9 11148,3
>  401
> >>
> >>
> >0007  13,2 

Re: Building and visualizing the Beam SQL graph

2018-06-11 Thread Kenneth Knowles
In other words, revert https://github.com/apache/beam/pull/4705/files, at
least in spirit? I agree :-)

Kenn

On Mon, Jun 11, 2018 at 11:39 AM Andrew Pilloud  wrote:

> We are currently converting the Calcite Rel tree to Beam by recursively
> building a tree of nested PTransforms. This results in a weird nested graph
> in the dataflow UI where each node contains its inputs nested inside of it.
> I'm going to change the internal data structure for converting the tree
> from a PTransform to a PCollection, which will result in a more accurate
> representation of the tree structure being built and should simplify the
> code as well. This will not change the public interface to SQL, which will
> remain a PTransform. Any thoughts or objections?
>
> I was also wondering if there are tools for visualizing the Beam graph
> aside from the dataflow runner UI. What other tools exist?
>
> Andrew
>


Re: Building and visualizing the Beam SQL graph

2018-06-11 Thread Huygaa Batsaikhan
I was also wondering the same thing. I don't think there is any
visualization tool for Beam. :(

On Mon, Jun 11, 2018 at 11:39 AM Andrew Pilloud  wrote:

> We are currently converting the Calcite Rel tree to Beam by recursively
> building a tree of nested PTransforms. This results in a weird nested graph
> in the dataflow UI where each node contains its inputs nested inside of it.
> I'm going to change the internal data structure for converting the tree
> from a PTransform to a PCollection, which will result in a more accurate
> representation of the tree structure being built and should simplify the
> code as well. This will not change the public interface to SQL, which will
> remain a PTransform. Any thoughts or objections?
>
> I was also wondering if there are tools for visualizing the Beam graph
> aside from the dataflow runner UI. What other tools exist?
>
> Andrew
>


Re: Beam SQL: Integrating runners & IO

2018-06-11 Thread Andrew Pilloud
Thanks for the great writeup Kenn! I really like the part about pushing
the TableProvider abstraction into core and the IOs. This would make it
really easy to extend the IOs supported by your SQL shell just by adding
the appropriate IOs to the classpath. It would also make testing much
easier as we could create mock IOs that don't depend on SQL for some of the
JDBC integration tests.

Andrew

On Mon, Jun 11, 2018 at 10:44 AM Kenneth Knowles  wrote:

> Hi all,
>
> Andrew mentioned something super cool about what he and Anton have done
> with Beam SQL: it now implements a JDBC driver (via Calcite Avatica).
>
> And since the sqlline client is tiny, it is just baked in. So you can java
> -jar the JDBC driver and run a little shell. But the shell only has the
> direct runner, and to make a nice experience for SQL users who don't want
> to deal with Java, there are some challenges.
>
> I wrote up a really quick doc for brainstorming
>
> https://s.apache.org/beam-sql-packaging
>
> Please lend you comments and expertise!
>
> Kenn
>


Building and visualizing the Beam SQL graph

2018-06-11 Thread Andrew Pilloud
We are currently converting the Calcite Rel tree to Beam by recursively
building a tree of nested PTransforms. This results in a weird nested graph
in the dataflow UI where each node contains its inputs nested inside of it.
I'm going to change the internal data structure for converting the tree
from a PTransform to a PCollection, which will result in a more accurate
representation of the tree structure being built and should simplify the
code as well. This will not change the public interface to SQL, which will
remain a PTransform. Any thoughts or objections?

I was also wondering if there are tools for visualizing the Beam graph
aside from the dataflow runner UI. What other tools exist?

Andrew


Re: Multimap PCollectionViews' values udpated rather than appended

2018-06-11 Thread Carlos Alonso
Many thanks for your help. Actually, my use case emits the entire map
everytime, so I guess I'm good to go with discarding mode.

This test reproduces the issue:
https://github.com/calonso/beam_experiments/blob/master/refreshingsideinput/src/test/scala/com/mrcalonso/RefreshingSideInput2Test.scala#L19-L53

Hope it helps

On Mon, Jun 4, 2018 at 9:04 PM Lukasz Cwik  wrote:

> Carlos, can you provide a test/code snippet for the bug that shows the
> issue?
>
> On Mon, Jun 4, 2018 at 11:57 AM Lukasz Cwik  wrote:
>
>> +dev@beam.apache.org
>> Note that this is likely a bug in the DirectRunner for accumulation mode,
>> filed: https://issues.apache.org/jira/browse/BEAM-4470
>>
>> Discarding mode is meant to always be the latest firing, the issue though
>> is that you need to emit the entire map every time. If you can do this,
>> then it makes sense to use discarding mode. The issue with discarding mode
>> is that if your first trigger firing produces (A, 1), (B, 1) and your
>> second firing produces (B, 2), the multimap will only contain (B, 2) and
>> (A, 1) will have been discarded.
>>
>> To my knowledge, there is no guarantee about the order in which the
>> values are combined. You will need to use some piece of information about
>> the element to figure out which is the latest (or encode some additional
>> information along with each element to make this easy).
>>
>> On Thu, May 31, 2018 at 9:16 AM Carlos Alonso 
>> wrote:
>>
>>> I've improved the example a little and added some tests
>>> https://github.com/calonso/beam_experiments/blob/master/refreshingsideinput/src/test/scala/com/mrcalonso/RefreshingSideInput2Test.scala
>>>
>>> The behaviour is slightly different, which is possibly because of the
>>> different runners (Dataflow/Direct) implementations, but still not working.
>>>
>>> Now what happens is that although the internal PCollection gets updated,
>>> the view isn't. This is happening regardless of the accumulation mode.
>>>
>>> Regarding the accumulation mode on Dataflow... That was it!! Now the
>>> sets contain all the items, however, one more question, is the ordering
>>> within the set deterministic? (i.e: Can I assume that the latest will
>>> always be on the last position of the Iterable object?)
>>>
>>> Also... given that for my particular case I only want the latest
>>> version, would you advice me to go ahead with Discarding mode?
>>>
>>> Regards
>>>
>>> On Thu, May 31, 2018 at 4:44 PM Lukasz Cwik  wrote:
>>>
 The trigger definition in the sample code you have is using discarding
 firing mode. Try swapping to using accumulating mode.


 On Thu, May 31, 2018 at 1:42 AM Carlos Alonso 
 wrote:

> But I think what I'm experiencing is quite different. Basically the
> side input is updated, but only one element is found on the Iterable that
> is the value of any key of the multimap.
>
> I mean, no concatenation seems to be happening. On the linked thread,
> Kenn suggests that every firing will add the new value to the set of 
> values
> for the emitted key, but what I'm experiencing is that the new value is
> there, but just itself (i.e: is the only element in the set).
>
> @Robert, I'm using
> Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane())
>
> On Wed, May 30, 2018 at 7:46 PM Lukasz Cwik  wrote:
>
>> An alternative to the thread that Kenn linked (adding support for
>> retractions) is to add explicit support for combiners into side inputs. 
>> The
>> system currently works by using a hardcoded concatenating combiner, so
>> maps, lists, iterables, singletons, multimaps all work by concatenating 
>> the
>> set of values emitted and then turning it into a view which is why it is 
>> an
>> error for a singleton and map view if the trigger fires multiple times.
>>
>> On Wed, May 30, 2018 at 10:01 AM Kenneth Knowles 
>> wrote:
>>
>>> Yes, this is a known issue. Here's a prior discussion:
>>> https://lists.apache.org/thread.html/e9518f5d5f4bcf7bab02de2cb9fe1bd5293d87aa12d46de1eac4600b@%3Cuser.beam.apache.org%3E
>>>
>>> It is actually long-standing and the solution is known but hard.
>>>
>>>
>>>
>>> On Wed, May 30, 2018 at 9:48 AM Carlos Alonso 
>>> wrote:
>>>
 Hi everyone!!

 Working with multimap based side inputs on the global window I'm
 experiencing something unexpected (at least to me) that I'd like to 
 share
 with you to clarify.

 The way I understand multimaps is that when one emits two values
 for the same key for the same window (obvious thing here as I'm 
 working on
 the Global one), the newly emitted values are appended to the Iterable
 collection that is the value for that particular key on the map.

 Testing it in this job (it is using scio, but side inputs are
 implemented with 

Re: Beam SQL Improvements

2018-06-11 Thread Reuven Lax
Does DirectRunner do this today?

On Mon, Jun 4, 2018 at 9:10 PM Lukasz Cwik  wrote:

> Shouldn't the runner isolate each instance of the pipeline behind an
> appropriate class loader?
>
> On Sun, Jun 3, 2018 at 12:45 PM Reuven Lax  wrote:
>
>> Just an update: Romain and I chatted on Slack, and I think I understand
>> his concern. The concern wasn't specifically about schemas, rather about
>> having a generic way to register per-ParDo state that has worker lifetime.
>> As evidence that such is needed, in many cases static variables are used to
>> simiulate that. static variables however have downsides - if two pipelines
>> are run on the same JVM (happens often with unit tests, and there's nothing
>> that prevents a runner from doing so in a production environment), these
>> static variables will interfere with each other.
>>
>> On Thu, May 24, 2018 at 12:30 AM Reuven Lax  wrote:
>>
>>> Romain, maybe it would be useful for us to find some time on slack. I'd
>>> like to understand your concerns. Also keep in mind that I'm tagging all
>>> these classes as Experimental for now, so we can definitely change these
>>> interfaces around if we decide they are not the best ones.
>>>
>>> Reuven
>>>
>>> On Tue, May 22, 2018 at 11:35 PM Romain Manni-Bucau <
>>> rmannibu...@gmail.com> wrote:
>>>
 Why not extending ProcessContext to add the new remapped output? But
 looks good (the part i dont like is that creating a new context each time a
 new feature is added is hurting users. What when beam will add some
 reactive support? ReactiveOutputReceiver?)

 Pipeline sounds the wrong storage since once distributed you serialized
 the instances so kind of broke the lifecycle of the original instance and
 have no real release/close hook on them anymore right? Not sure we can do
 better than dofn/source embedded instances today.




 Le mer. 23 mai 2018 08:02, Romain Manni-Bucau 
 a écrit :

>
>
> Le mer. 23 mai 2018 07:55, Jean-Baptiste Onofré  a
> écrit :
>
>> Hi,
>>
>> IMHO, it would be better to have a explicit transform/IO as converter.
>>
>> It would be easier for users.
>>
>> Another option would be to use a "TypeConverter/SchemaConverter" map
>> as
>> we do in Camel: Beam could check the source/destination "type" and
>> check
>> in the map if there's a converter available. This map can be store as
>> part of the pipeline (as we do for filesystem registration).
>>
>
>
> It works in camel because it is not strongly typed, isnt it? So can
> require a beam new pipeline api.
>
> +1 for the explicit transform, if added to the pipeline api as coder
> it wouldnt break the fluent api:
>
> p.apply(io).setOutputType(Foo.class)
>
> Coders can be a workaround since they owns the type but since the
> pcollection is the real owner it is surely saner this way, no?
>
> Also it needs to ensure all converters are present before running the
> pipeline probably, no implicit environment converter support is probably
> good to start to avoid late surprises.
>
>
>
>> My $0.01
>>
>> Regards
>> JB
>>
>> On 23/05/2018 07:51, Romain Manni-Bucau wrote:
>> > How does it work on the pipeline side?
>> > Do you generate these "virtual" IO at build time to enable the
>> fluent
>> > API to work not erasing generics?
>> >
>> > ex: SQL(row)->BigQuery(native) will not compile so we need a
>> > SQL(row)->BigQuery(row)
>> >
>> > Side note unrelated to Row: if you add another registry maybe a
>> pretask
>> > is to ensure beam has a kind of singleton/context to avoid to
>> duplicate
>> > it or not track it properly. These kind of converters will need a
>> global
>> > close and not only per record in general:
>> > converter.init();converter.convert(row);converter.destroy();,
>> > otherwise it easily leaks. This is why it can require some way to
>> not
>> > recreate it. A quick fix, if you are in bytebuddy already, can be
>> to add
>> > it to setup/teardown pby, being more global would be nicer but is
>> more
>> > challenging.
>> >
>> > Romain Manni-Bucau
>> > @rmannibucau  |  Blog
>> >  | Old Blog
>> >  | Github
>> >  | LinkedIn
>> >  | Book
>> > <
>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>> >
>> >
>> >
>> > Le mer. 23 mai 2018 à 07:22, Reuven Lax > > > a écrit :
>> >
>> > No - the only modules we need to add to core are the ones we
>> choose
>> > to add. For example, I will probably add a registration for
>> > 

Beam SQL: Integrating runners & IO

2018-06-11 Thread Kenneth Knowles
Hi all,

Andrew mentioned something super cool about what he and Anton have done
with Beam SQL: it now implements a JDBC driver (via Calcite Avatica).

And since the sqlline client is tiny, it is just baked in. So you can java
-jar the JDBC driver and run a little shell. But the shell only has the
direct runner, and to make a nice experience for SQL users who don't want
to deal with Java, there are some challenges.

I wrote up a really quick doc for brainstorming

https://s.apache.org/beam-sql-packaging

Please lend you comments and expertise!

Kenn


Re: [FYI] New Apache Beam Swag Store!

2018-06-11 Thread Mikhail Gryzykhin
That's nice!

More colors are appreciated :)

--Mikhail


On Sun, Jun 10, 2018 at 8:20 PM Kenneth Knowles  wrote:

> Sweet! Agree with Raghu :-)
>
> Kenn
>
> On Sun, Jun 10, 2018 at 6:06 AM Matthias Baetens <
> baetensmatth...@gmail.com> wrote:
>
>> Great news, big thanks for all the work, Gris! Looking forward to people
>> wearing this around the globe ;)
>>
>> On Sat, 9 Jun 2018 at 01:28 Ankur Goenka  wrote:
>>
>>> Awesome!
>>>
>>>
>>> On Fri, Jun 8, 2018 at 4:24 PM Pablo Estrada  wrote:
>>>
 Nice : D

 On Fri, Jun 8, 2018, 3:43 PM Raghu Angadi  wrote:

> Woo-hoo! This is terrific.
>
> If we are increasing color choices I would like black or charcoal...
> Beam logo would really pop on a dark background.
>
> On Fri, Jun 8, 2018 at 3:32 PM Griselda Cuevas 
> wrote:
>
>> Hi Beam Community,
>>
>> I just want to share with you the exciting news about our brand new
>> Apache Beam Swag Store!
>>
>> You can find it here: https://store-beam.myshopify.com/
>>
>> *How does it work?*
>>
>>- You can just select the items you want and check-out. Our
>>Vendor ships to anywhere in the world and normally can have swag to be
>>delivered within 1 week. Each company or user will need to pay for 
>> their
>>own swag.
>>- If you are hosting an event or representing Beam at one, reach
>>out to me or the beam-events-meetups slack channel, I'll be happy to 
>> review
>>your event and see if we can sponsor the swag. We'll have codes for 
>> this
>>occasions thanks to Google, who has sponsored an initial inventory.
>>
>> If you have feedback, ideas on new swag, questions or suggestions,
>> reach out to me and/or Matthias Baetens.
>>
>> Happy Friday!
>> G
>>
>>
>> --
 Got feedback? go/pabloem-feedback
 

>>> --
>>
>>
>


Re: [DISCUSS] Use Confluence wiki for non-user-facing stuff

2018-06-11 Thread Kenneth Knowles
Yup, I'm "kenn". Thanks!

To the thread - if you want write access, please ask. Committers should
certainly have it (but I'll do it lazily) and beyond that I think is up for
discussion.

Kenn

On Mon, Jun 11, 2018 at 5:52 AM Daniel Kulp  wrote:

> Neither you nor JB was setup to be able to do anything with the Wiki.   I
> just enabled JB and yourself (assume “kenn” is you) to have permission to
> administer the space so you should be able to add other people.
>
> Dan
>
>
>
> On Jun 9, 2018, at 9:57 AM, Kenneth Knowles  wrote:
>
> Yea, we have one but it seems to not be set up quite right. JB you are the
> admin from the original incubation. Can you check the permissions? Also
> make some more admins (maybe all the PMC to start with). I would like to
> help out.
>
> https://cwiki.apache.org/confluence/display/BEAM/
> https://issues.apache.org/jira/browse/INFRA-11181
>
>
> Kenn
>
> On Fri, Jun 8, 2018 at 10:12 PM Jean-Baptiste Onofré 
> wrote:
>
>> +1
>>
>> That's funny, because AFAIR, it's what we discussed in the early stage
>> of the project.
>> We can also link wiki pages on the website if we want to provide details.
>> AFAIR, we already have a confluence wiki space for Beam.
>>
>> Regards
>> JB
>>
>> On 07/06/2018 22:23, Kenneth Knowles wrote:
>> > Hi all,
>> >
>> > I've been in half a dozen conversations recently about whether to have a
>> > wiki and what to use it for. Some things I've heard:
>> >
>> >  - "why is all this stuff that users don't care about here?"
>> >  - "can we have a lighter weight place to put technical references for
>> > contributors"
>> >
>> > So I want to consider as a community starting up our wiki. Ideas for
>> > what could go there:
>> >
>> >  - Collection of links to design docs like
>> > https://beam.apache.org/contribute/design-documents/
>> >  - Specialized walkthroughs
>> > like https://beam.apache.org/contribute/docker-images/
>> >  - Best-effort notes that just try to help out
>> > like https://beam.apache.org/contribute/intellij/
>> >  - Docs on in-progress stuff
>> > like https://beam.apache.org/documentation/runners/jstorm/
>> >  - Expanded instructions for committers, more
>> > than https://beam.apache.org/contribute/committer-guide/
>> >  - BIPs / summaries of collections of JIRA
>> >  - Docs sitting in markdown in the repo
>> > like https://github.com/apache/beam/blob/master/sdks/CONTAINERS.md
>> > and https://github.com/apache/beam-site/blob/asf-site/README.md (which
>> > will soon not be a toplevel README)
>> >
>> > What do you think?
>> >
>> > (a) should we do it?
>> > (b) what should go there?
>> > (c) what should not go there?
>> >
>> > Kenn
>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>
> --
> Daniel Kulp
> dk...@apache.org - http://dankulp.com/blog
> Talend Community Coder - http://coders.talend.com
>
>


Re: [DISCUSS] Use Confluence wiki for non-user-facing stuff

2018-06-11 Thread Daniel Kulp
Neither you nor JB was setup to be able to do anything with the Wiki.   I just 
enabled JB and yourself (assume “kenn” is you) to have permission to administer 
the space so you should be able to add other people.

Dan





> On Jun 9, 2018, at 9:57 AM, Kenneth Knowles  wrote:
> 
> Yea, we have one but it seems to not be set up quite right. JB you are the 
> admin from the original incubation. Can you check the permissions? Also make 
> some more admins (maybe all the PMC to start with). I would like to help out.
> 
> https://cwiki.apache.org/confluence/display/BEAM/ 
> 
> https://issues.apache.org/jira/browse/INFRA-11181 
> 
> 
> 
> Kenn
> 
> On Fri, Jun 8, 2018 at 10:12 PM Jean-Baptiste Onofré  > wrote:
> +1
> 
> That's funny, because AFAIR, it's what we discussed in the early stage
> of the project.
> We can also link wiki pages on the website if we want to provide details.
> AFAIR, we already have a confluence wiki space for Beam.
> 
> Regards
> JB
> 
> On 07/06/2018 22:23, Kenneth Knowles wrote:
> > Hi all,
> > 
> > I've been in half a dozen conversations recently about whether to have a
> > wiki and what to use it for. Some things I've heard:
> > 
> >  - "why is all this stuff that users don't care about here?"
> >  - "can we have a lighter weight place to put technical references for
> > contributors"
> > 
> > So I want to consider as a community starting up our wiki. Ideas for
> > what could go there:
> > 
> >  - Collection of links to design docs like
> > https://beam.apache.org/contribute/design-documents/ 
> > 
> >  - Specialized walkthroughs
> > like https://beam.apache.org/contribute/docker-images/ 
> > 
> >  - Best-effort notes that just try to help out
> > like https://beam.apache.org/contribute/intellij/ 
> > 
> >  - Docs on in-progress stuff
> > like https://beam.apache.org/documentation/runners/jstorm/ 
> > 
> >  - Expanded instructions for committers, more
> > than https://beam.apache.org/contribute/committer-guide/ 
> > 
> >  - BIPs / summaries of collections of JIRA
> >  - Docs sitting in markdown in the repo
> > like https://github.com/apache/beam/blob/master/sdks/CONTAINERS.md 
> > 
> > and https://github.com/apache/beam-site/blob/asf-site/README.md 
> >  (which
> > will soon not be a toplevel README)
> > 
> > What do you think?
> > 
> > (a) should we do it?
> > (b) what should go there?
> > (c) what should not go there?
> > 
> > Kenn
> 
> -- 
> Jean-Baptiste Onofré
> jbono...@apache.org 
> http://blog.nanthrax.net 
> Talend - http://www.talend.com 

-- 
Daniel Kulp
dk...@apache.org  - http://dankulp.com/blog 

Talend Community Coder - http://coders.talend.com 


Re: Announcement & Proposal: HDFS tests on large cluster.

2018-06-11 Thread Kamil Szewczyk
Hi all,

as a positive outcome of extending kubernetes cluster at the bottom of the
https://builds.apache.org/view/A-D/view/Beam/job/beam_PerformanceTests_Analysis/37/consoleText
and on dedicated slack channel
https://apachebeam.slack.com/messages/CAB3W69SS/ we can observe better
stability of the tests after cluster resize. Most of the execution times
slightly decreased and finally, all tests were executed and analysed.

Thanks,
Kamil Szewczyk



2018-06-08 13:13 GMT+02:00 Łukasz Gajowy :

> @Pablo this is exactly as Chamikara says. In fact, there is a dedicated
> Gcloud project for whole testing infrastructure (called
> "apache-beam-testing"). It provides the Kubernetes cluster for the data
> stores as well as big query storage for the test results presented in the
> testing dashboard.
>
> @Alan thanks a lot!
>
> Best regards,
> Łukasz
>
>
>
> czw., 7 cze 2018 o 22:37 Chamikara Jayalath 
> napisał(a):
>
>> We still use Jenkins machines to execute the test but data stores are
>> hosted in Kubernetes.
>>
>> On Thu, Jun 7, 2018 at 1:35 PM Pablo Estrada  wrote:
>>
>>> Just out of curiosity: This does not use the Jenkins machines then?
>>> -P.
>>>
>>> On Thu, Jun 7, 2018 at 1:33 PM Alan Myrvold  wrote:
>>>
 Done. Changed the size of the io-datastores kubernetes cluster in
 apache-beam-testing to 3 nodes.

 On Thu, Jun 7, 2018 at 1:45 AM Kamil Szewczyk 
 wrote:

> Hi,
>
> the node pool size of io-datastores kubernetes cluster in
> apache-beam-testing project must be changed from 1 -> 3 (or other value).
> @Alan Myrvold was already helpful with kubernetes cluster settings so
> far, but I am not aware who made decisions regarding that as
> this will increase monthly billing.
>
> Kamil Szewczyk
>
> 2018-06-07 6:27 GMT+02:00 Kenneth Knowles :
>
>> This is rad. Another +1 from me for a bigger cluster. What do you
>> need to make that happen?
>>
>> Kenn
>>
>> On Wed, Jun 6, 2018 at 10:16 AM Pablo Estrada 
>> wrote:
>>
>>> This is really cool!
>>>
>>> +1 for having a cluster with more than one machine run the test.
>>>
>>> -P.
>>>
>>> On Wed, Jun 6, 2018 at 9:57 AM Chamikara Jayalath <
>>> chamik...@google.com> wrote:
>>>
 On Wed, Jun 6, 2018 at 5:19 AM Łukasz Gajowy <
 lukasz.gaj...@gmail.com> wrote:

> Hi all,
>
> I'd like to announce that thanks to Kamil Szewczyk, since this PR
>  we have 4 file-based
> HDFS tests run on a "Large HDFS Cluster"! More specifically I mean:
>
> - beam_PerformanceTests_Compressed_TextIOIT_HDFS
> - beam_PerformanceTests_Compressed_TextIOIT_HDFS
> - beam_PerformanceTests_AvroIOIT_HDFS
> - beam_PerformanceTests_XmlIOIT_HDFS
>
> The "Large HDFS Cluster" (in contrast to the small one, that is
> also available) consists of a master node and three data nodes all in
> separate pods. Thanks to that we can mimic more real-life scenarios 
> on HDFS
> (3 distributed nodes) and possibly run bigger tests so there's 
> progress! :)
>
>
 This is great. Also, looks like results are available in test
 dashboard: https://apache-beam-testing.appspot.com/
 explore?dashboard=5755685136498688
 (BTW we should add information about dashboard to the testing doc:
 https://beam.apache.org/contribute/testing/)

 I'm currently working on proper documentation for this so that
> everyone can use it in IOITs (stay tuned).
>
> Regarding the above, I'd like to propose scaling up the
> Kubernetes cluster. AFAIK, currently, it consists of 1 node. If we 
> scale it
> up to eg. 3 nodes, the HDFS' kubernetes pods will distribute 
> themselves on
> different machines rather than one, making it an even more "real-life"
> scenario (possibly more efficient?). Moreover, other Performance Tests
> (such as JDBC or mongo) could use more space for their infrastructure 
> as
> well. Scaling up the cluster could also turn out useful for some 
> future
> efforts, like BEAM-4508[1] (adapting and running some old IOITs
> on Jenkins).
>
> WDYT? Are there any objections?
>
 +1 for increasing the size of Kubernetes cluster.

>
> [1] https://issues.apache.org/jira/browse/BEAM-4508
>
> --
>>> Got feedback? go/pabloem-feedback
>>> 
>>>
>>
> --
>>> Got feedback? go/pabloem-feedback
>>> 
>>>
>>