Re: [ANNOUNCE] Beam 2.17.0 Released!

2020-01-10 Thread jincheng sun
Thank you Mikhail!

Yichi Zhang 于2020年1月11日 周六09:09写道:

> Thank you Mikahil!
>
> On Fri, Jan 10, 2020 at 12:52 PM Ahmet Altay  wrote:
>
>> Thank you Mikhail!
>>
>> On Fri, Jan 10, 2020 at 12:40 PM Kyle Weaver  wrote:
>>
>>> Hooray! Thanks to Mikhail and everyone else who contributed.
>>>
>>> On Fri, Jan 10, 2020 at 10:23 AM Maximilian Michels 
>>> wrote:
>>>
 At last :) Thank you for making it happen Mikhail! Also thanks to
 everyone else who tested the release candidate.

 Cheers,
 Max

 On 10.01.20 19:01, Mikhail Gryzykhin wrote:
 > The Apache Beam team is pleased to announce the release of version
 2.17.0.
 >
 > Apache Beam is an open source unified programming model to define and
 > execute data processing pipelines, including ETL, batch and stream
 > (continuous) processing. See https://beam.apache.org
 > 
 >
 > You can download the release here:
 >
 > https://beam.apache.org/get-started/downloads/
 >
 > This release includes bug fixes, features, and improvements detailed
 on
 > the Beam blog:
 https://beam.apache.org/blog/2020/01/06/beam-2.17.0.html
 > 
 >
 > Thanks to everyone who contributed to this release, and we hope you
 > enjoy using Beam 2.17.0.

>>> --

Best,
Jincheng
-
Twitter: https://twitter.com/sunjincheng121
-


Re: [VOTE] Vendored Dependencies Release

2020-01-10 Thread Kai Jiang
+1 (non-binding)

On Thu, Jan 9, 2020 at 8:48 PM jincheng sun 
wrote:

> +1,checked list as follows:
>  - verified the hash and signature
>  - verified that there is no linkage errors
>  - verified that the content of the pom is expected: the shaded
> dependencies are not exposed, the scope of the logging dependencies are
> runtime, etc.
>
> Best,
> Jincheng
>
> Kenneth Knowles 于2020年1月10日 周五12:29写道:
>
>> +1
>>
>> On Thu, Jan 9, 2020 at 4:03 PM Ahmet Altay  wrote:
>>
>>> +1
>>>
>>> On Thu, Jan 9, 2020 at 2:04 PM Pablo Estrada  wrote:
>>>
 +1

 verified sha1 and md5 hashes.

 On Thu, Jan 9, 2020 at 10:28 AM Luke Cwik  wrote:

> +1
>
> I validated that no classes appeared outside of the
> org.apache.beam.vendor.grpc.v1p26p0 namespace and I also validated that 
> the
> linkage checker listed no potential linkage errors.
>
> On Thu, Jan 9, 2020 at 10:25 AM Luke Cwik  wrote:
>
>> Please review the release of the following artifacts that we vendor:
>>  * beam-vendor-grpc-1_26_0
>>
>> Hi everyone,
>> Please review and vote on the release candidate #1 for the version
>> 0.1, as follows:
>> [ ] +1, Approve the release
>> [ ] -1, Do not approve the release (please provide specific comments)
>>
>>
>> The complete staging area is available for your review, which
>> includes:
>> * the official Apache source release to be deployed to
>> dist.apache.org [1], which is signed with the key with
>> fingerprint EAD5DE293F4A03DD2E77565589E68A56E371CCA2 [2],
>> * all artifacts to be deployed to the Maven Central Repository [3],
>> * commit hash "e60d49bdf1ed85e8f3efa1da784227f381a9e085" [4],
>>
>> The vote will be open for at least 72 hours. It is adopted by
>> majority approval, with at least 3 PMC affirmative votes.
>>
>> Thanks,
>> Release Manager
>>
>> [1] https://dist.apache.org/repos/dist/dev/beam/vendor/
>> [2] https://dist.apache.org/repos/dist/release/beam/KEYS
>> [3]
>> https://repository.apache.org/content/repositories/orgapachebeam-1089/
>> [4]
>> https://github.com/apache/beam/commit/e60d49bdf1ed85e8f3efa1da784227f381a9e085
>>
> --
>
> Best,
> Jincheng
> -
> Twitter: https://twitter.com/sunjincheng121
> -
>


Re: Go SplittableDoFn prototype and proposed changes

2020-01-10 Thread Robert Burke
Thank you for sharing Daniel!

Resolving SplittableDoFns for the Go SDK even just as far as initial
splitting will take the SDK that much closer to exiting its experimental
status.

It's especially exciting seeing this work on Flink and on the Python direct
runner!

On Fri, Jan 10, 2020, 5:36 PM Daniel Oliveira 
wrote:

> Hey Beam devs,
>
> So several months ago I posted my Go SDF proposal and got a lot of good
> feedback (thread
> ,
> doc ). Since then I've been working on
> implementing it and I've got an initial prototype ready to show off! It
> works with initial splitting on Flink, and has a decently documented API.
> Also in the second part of the email I'll also be proposing changes to the
> original doc, based on my experience working on this prototype.
>
> To be clear, this is *not* ready to officially go into Beam yet; the API
> is still likely to go through changes. Rather, I'm showing this off to show
> that progress is being made on SDFs, and to provide some context to the
> changes I'll be proposing below.
>
> Here's a link to the repo and branch so you can download it, and a link to
> the changes specifically:
> Repo: https://github.com/youngoli/beam/tree/gosdf
> Changes:
> https://github.com/apache/beam/commit/28140ee3471d6cb80e74a16e6fd108cc380d4831
>
> If you give it a try and have any thoughts, please let me know! I'm open
> to any and all feedback.
>
> ==
>
> Proposed Changes
> Doc: https://s.apache.org/beam-go-sdf (Select "Version 1" from version
> history.)
>
> For anyone reading this who hasn't already read the doc above, I suggest
> reading it first, since I'll be referring to concepts from it.
>
> After working on the prototype I've changed my mind on the original
> decisions to go with an interface approach and a combined restriction +
> tracker. But I don't want to go all in and create another doc with a
> detailed proposal, so I've laid out a brief summary of the changes to get
> some initial feedback before I go ahead and start working on these changes
> in detail. Please let me know what you think!
>
> *1. Change from native Go interfaces to dynamic reflection-based API.*
>
> Instead of the native Go interfaces (SplittableDoFn, RProvider, and
> RTracker) described in the doc and implemented in the prototype, use the
> same dynamic approach that the Go SDK already uses for DoFns: Use the
> reflection system to examine the names and signatures of methods in the
> user's DoFn, RProvider, and RTracker.
>
> Original approach reasoning:
>
>- Simpler, so faster to implement and less bug-prone.
>- The extra burden on the user to keep types consistent is ok since
>most users of SDFs are more advanced
>
> Change reasoning:
>
>- In the prototype, I found interfaces to require too much extra
>boilerplate which added more complexity than expected. (Examples: Constant
>casting,
>- More consistent API: Inconsistency between regular DoFns (dynamic)
>and SDF API (interfaces) was jarring and unintuitive when implementing SDFs
>as a user.
>
> Implementation: Full details are up for discussion, but the goal is to
> make the RProvider and  RTracker interfaces dynamic, so we can replace all
> instances of interface{} in the methods with the actual element types
> (i.e. fake generics). Also uses of the RProvider and RTracker interfaces in
> signatures can be replaced with the implementations of those
> providers/trackers. This will require a good amount of additional work in
> the DoFn validation codebase and the code generator. Plus a fair amount of
> additional user code validation will be needed and more testing since the
> new code is more complex.
>
> *2. Seperate the restriction tracker and restriction.*
>
> Currently the API has the restriction combined with the tracker. In most
> other SDKs and within the SDF model, the two are usually separate concepts,
> and this change is to follow that approach and split the two.
>
> Original approach reasoning:
>
>- It was considered simpler to avoid another level of type casting in
>the API with the interface approach.
>
> Change reasoning:
>
>- We are no longer going with the interface approach. With "fake
>generics", it is simpler to keep the two concepts separate.
>- Requiring users to specify custom coders in order to only encode the
>restriction and not the tracker ended up adding additional complexity
>anyway.
>
> Implementation: In the API have the restriction tracker initialized with a
> restriction object accessible via a getter. The restriction itself will be
> the only thing serialized, so it will be wrapped and unwrapped with the
> tracker before the user code is invoked. This wouldn't add very little work
> as it would mostly be bundled with the interface->dynamic approach change.
>
>
> 

Go SplittableDoFn prototype and proposed changes

2020-01-10 Thread Daniel Oliveira
Hey Beam devs,

So several months ago I posted my Go SDF proposal and got a lot of good
feedback (thread
,
doc ). Since then I've been working on
implementing it and I've got an initial prototype ready to show off! It
works with initial splitting on Flink, and has a decently documented API.
Also in the second part of the email I'll also be proposing changes to the
original doc, based on my experience working on this prototype.

To be clear, this is *not* ready to officially go into Beam yet; the API is
still likely to go through changes. Rather, I'm showing this off to show
that progress is being made on SDFs, and to provide some context to the
changes I'll be proposing below.

Here's a link to the repo and branch so you can download it, and a link to
the changes specifically:
Repo: https://github.com/youngoli/beam/tree/gosdf
Changes:
https://github.com/apache/beam/commit/28140ee3471d6cb80e74a16e6fd108cc380d4831

If you give it a try and have any thoughts, please let me know! I'm open to
any and all feedback.

==

Proposed Changes
Doc: https://s.apache.org/beam-go-sdf (Select "Version 1" from version
history.)

For anyone reading this who hasn't already read the doc above, I suggest
reading it first, since I'll be referring to concepts from it.

After working on the prototype I've changed my mind on the original
decisions to go with an interface approach and a combined restriction +
tracker. But I don't want to go all in and create another doc with a
detailed proposal, so I've laid out a brief summary of the changes to get
some initial feedback before I go ahead and start working on these changes
in detail. Please let me know what you think!

*1. Change from native Go interfaces to dynamic reflection-based API.*

Instead of the native Go interfaces (SplittableDoFn, RProvider, and
RTracker) described in the doc and implemented in the prototype, use the
same dynamic approach that the Go SDK already uses for DoFns: Use the
reflection system to examine the names and signatures of methods in the
user's DoFn, RProvider, and RTracker.

Original approach reasoning:

   - Simpler, so faster to implement and less bug-prone.
   - The extra burden on the user to keep types consistent is ok since most
   users of SDFs are more advanced

Change reasoning:

   - In the prototype, I found interfaces to require too much extra
   boilerplate which added more complexity than expected. (Examples: Constant
   casting,
   - More consistent API: Inconsistency between regular DoFns (dynamic) and
   SDF API (interfaces) was jarring and unintuitive when implementing SDFs as
   a user.

Implementation: Full details are up for discussion, but the goal is to make
the RProvider and  RTracker interfaces dynamic, so we can replace all
instances of interface{} in the methods with the actual element types (i.e.
fake generics). Also uses of the RProvider and RTracker interfaces in
signatures can be replaced with the implementations of those
providers/trackers. This will require a good amount of additional work in
the DoFn validation codebase and the code generator. Plus a fair amount of
additional user code validation will be needed and more testing since the
new code is more complex.

*2. Seperate the restriction tracker and restriction.*

Currently the API has the restriction combined with the tracker. In most
other SDKs and within the SDF model, the two are usually separate concepts,
and this change is to follow that approach and split the two.

Original approach reasoning:

   - It was considered simpler to avoid another level of type casting in
   the API with the interface approach.

Change reasoning:

   - We are no longer going with the interface approach. With "fake
   generics", it is simpler to keep the two concepts separate.
   - Requiring users to specify custom coders in order to only encode the
   restriction and not the tracker ended up adding additional complexity
   anyway.

Implementation: In the API have the restriction tracker initialized with a
restriction object accessible via a getter. The restriction itself will be
the only thing serialized, so it will be wrapped and unwrapped with the
tracker before the user code is invoked. This wouldn't add very little work
as it would mostly be bundled with the interface->dynamic approach change.


Thanks,
Daniel Oliveira


Re: [RELEASE] Tracking 2.18

2020-01-10 Thread Udi Meiri
RC1 is almost ready, but Nexus login is down due to LDAP issues with Apache.

On Mon, Dec 16, 2019 at 9:53 AM Udi Meiri  wrote:

> The remaining 4 open blockers all have recently merged cherrypicks (at
> least 1 blocker is waiting on verification since it's a release process
> issue).
>
> Will attempt an RC today.
>
> On Thu, Dec 12, 2019 at 5:33 PM Udi Meiri  wrote:
>
>> Also marked 3 Jiras from these cherrypicks as blockers .
>> Current open blocker count: 7
>> .
>>
>> On Thu, Dec 12, 2019 at 5:21 PM Udi Meiri  wrote:
>>
>>> Just merged 6 PRs. :)
>>>
>>> On Thu, Dec 12, 2019 at 4:52 PM Udi Meiri  wrote:
>>>
 Update: I'm accepting cherrypicks with failing tests if the
 corresponding PR have passed them on master.

 I recall (without proof) that in the past, even with released worker
 containers for the in-process release, that ITs against the release branch
 still fail.

 On Tue, Dec 10, 2019 at 10:58 AM Udi Meiri  wrote:

> Re: cherrypicks on top of the release-2.18.0 branch
> The precommit tests are failing most likely due to some integration
> tests (wordcount, etc.) that are expecting the new 2.18 worker on 
> Dataflow.
> I'm working on building an initial version of that worker so that the
> tests may pass.
>
> On Thu, Dec 5, 2019 at 4:39 PM Robert Bradshaw 
> wrote:
>
>> Yeah, so I saw...
>>
>> On Thu, Dec 5, 2019 at 4:31 PM Udi Meiri  wrote:
>> >
>> > Sorry Robert the release was already cut yesterday.
>> >
>> >
>> >
>> > On Thu, Dec 5, 2019 at 8:37 AM Ismaël Mejía 
>> wrote:
>> >>
>> >> Colm, I just merged your PR and cherry picked it into 2.18.0
>> >> https://github.com/apache/beam/pull/10296
>> >>
>> >> On Thu, Dec 5, 2019 at 10:54 AM jincheng sun <
>> sunjincheng...@gmail.com> wrote:
>> >>>
>> >>> Thanks for the Tracking Udi!
>> >>>
>> >>> I have updated the status of some release blockers issues as
>> follows:
>> >>>
>> >>> - BEAM-8733 closed
>> >>> - BEAM-8620 reset the fix version to 2.19
>> >>> - BEAM-8618 reset the fix version to 2.19
>> >>>
>> >>> Best,
>> >>> Jincheng
>> >>>
>> >>> Colm O hEigeartaigh  于2019年12月5日周四 下午5:38写道:
>> 
>>  Could we get this one in 2.18 as well?
>> https://issues.apache.org/jira/browse/BEAM-8861
>> 
>>  Colm.
>> 
>>  On Wed, Dec 4, 2019 at 8:02 PM Udi Meiri 
>> wrote:
>> >
>> > Following the release calendar, I plan on cutting the 2.18
>> release branch today.
>> >
>> > There are currently 8 release blockers.
>> >
>>
>


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [PROPOSAL] Transition released containers to the official ASF dockerhub organization

2020-01-10 Thread Ahmet Altay
On Fri, Jan 10, 2020 at 3:33 PM Ahmet Altay  wrote:

>
>
> On Fri, Jan 10, 2020 at 3:32 PM Ankur Goenka  wrote:
>
>> Also curious to know if apache provide any infra support fro projects
>> under Apache umbrella and any quota limits they might have.
>>
>
Maybe Hannah can ask with an infra ticket?


>
>> On Fri, Jan 10, 2020, 2:26 PM Robert Bradshaw 
>> wrote:
>>
>>> One downside is that, unlike many of these projects, we release a
>>> dozen or so containers. Is there exactly (and only) one level of
>>> namespacing/nesting we can leverage here? (This isn't a blocker, but
>>> something to consider.)
>>>
>>
> After a quick search, I could not find a way to use more than one level of
> repositories. We can use the naming scheme we currently use to help with.
> Our repositories are named as apachebeam/X, we could start using
> apache/beam/X.
>
>
>>
>>> On Fri, Jan 10, 2020 at 2:06 PM Hannah Jiang 
>>> wrote:
>>> >
>>> > Thanks Ahmet for proposing it.
>>> > I will take it and work towards v2.19.
>>>
>>
Missed this part. Thank you Hannah!


> >
>>> > Hannah
>>> >
>>> > On Fri, Jan 10, 2020 at 1:50 PM Kyle Weaver 
>>> wrote:
>>> >>
>>> >> It'd be nice to have the clout/official sheen of apache attached to
>>> our containers. Although getting the required permissions might add some
>>> small overhead to the release process. For example, yesterday, when we
>>> needed to create new repositories (not just update existing ones), since we
>>> have top-level ownership of the apachebeam organization, it was quick and
>>> easy to add them. I imagine we'd have had to get approval from someone
>>> outside the project to do that under the apache org. But this won't need to
>>> happen very often, so it's probably not that big a deal.
>>> >>
>>> >> On Fri, Jan 10, 2020 at 1:40 PM Ahmet Altay  wrote:
>>> >>>
>>> >>> Hi all,
>>> >>>
>>> >>> I saw recent progress on the containers and wanted to bring this
>>> question to the attention of the dev list.
>>> >>>
>>> >>> Would it be possible to use the official ASF dockerhub organization
>>> for new Beam container releases? Concretely, starting from 2.19 could we
>>> release Beam containers to https://hub.docker.com/u/apache instead of
>>> https://hub.docker.com/u/apachebeam ?
>>> >>>
>>> >>> Ahmet
>>>
>>


Re: [ANNOUNCE] Beam 2.17.0 Released!

2020-01-10 Thread Yichi Zhang
Thank you Mikahil!

On Fri, Jan 10, 2020 at 12:52 PM Ahmet Altay  wrote:

> Thank you Mikhail!
>
> On Fri, Jan 10, 2020 at 12:40 PM Kyle Weaver  wrote:
>
>> Hooray! Thanks to Mikhail and everyone else who contributed.
>>
>> On Fri, Jan 10, 2020 at 10:23 AM Maximilian Michels 
>> wrote:
>>
>>> At last :) Thank you for making it happen Mikhail! Also thanks to
>>> everyone else who tested the release candidate.
>>>
>>> Cheers,
>>> Max
>>>
>>> On 10.01.20 19:01, Mikhail Gryzykhin wrote:
>>> > The Apache Beam team is pleased to announce the release of version
>>> 2.17.0.
>>> >
>>> > Apache Beam is an open source unified programming model to define and
>>> > execute data processing pipelines, including ETL, batch and stream
>>> > (continuous) processing. See https://beam.apache.org
>>> > 
>>> >
>>> > You can download the release here:
>>> >
>>> > https://beam.apache.org/get-started/downloads/
>>> >
>>> > This release includes bug fixes, features, and improvements detailed on
>>> > the Beam blog:
>>> https://beam.apache.org/blog/2020/01/06/beam-2.17.0.html
>>> > 
>>> >
>>> > Thanks to everyone who contributed to this release, and we hope you
>>> > enjoy using Beam 2.17.0.
>>>
>>


Re: Cleaning up SDK docker image tagging

2020-01-10 Thread Robert Bradshaw
On Fri, Jan 10, 2020 at 3:30 PM Kyle Weaver  wrote:
>
> > Does cloning a release, modifying the docker file, and building the
> > containers create a "new" container with a default release tag? If so,
> > we should discourage that
>
> Yes, and agreed. The doc you linked already mentions how to customize tags, 
> maybe we could also recommend the user always makes their own tag whenever 
> changing a released image.

I think we should discourage checking out the code and modifying the
docker file in pace, but that's another discussion.

> On Fri, Jan 10, 2020 at 2:33 PM Robert Bradshaw  wrote:
>>
>> On Fri, Jan 10, 2020 at 12:48 PM Kyle Weaver  wrote:
>> >
>> > > Shall we ALSO tag the image with git commit version for local build to 
>> > > keep track of obsolete images.
>> >
>> > This would mean we would have to be able to access the git commit from the 
>> > source, which might not be trivial (right now the Beam version e.g. 
>> > "2.18.0.dev" is hard-coded in some properties files). And the way it is 
>> > now keeps things simple and easy to read.
>>
>> It also means that as you're developing, you don't generate a long
>> trail of named containers that you'll never access again but are
>> harder to automatically prune.
>>
>> > > we can assume the images with the same tag are always identical
>>
>> This is only true if a developer never builds a container without
>> committing any local changes first.
>>
>> Image tags are like git tags. They also have hashes (like commit ids)
>> if one wants to ensure one is pointing to the exact same thing.
>>
>> > So far that's always been the case, but in case there are problems with 
>> > the published container images and we have to update them, we want to make 
>> > sure everyone gets the most up-to-date image [1].
>> >
>> > > 1. pull only when needed, so reduce unnecessary traffic for users.
>> >
>> > `docker pull` starts by checking if the local image is up-to-date with the 
>> > remote, and most of the time it will be, so no more network usage beyond 
>> > that is needed.
>> >
>> > > In case a user customize the image and rebuild it with the default tag
>> >
>> > The user should never need to build an image with the default release tag 
>> > (e.g. 2.17.0). They will use either the .dev tag (the default) or even 
>> > better, their own custom tag. (I suppose we can't stop users from manually 
>> > tagging their own container with the release tag, but most people should 
>> > know better.)
>>
>> Does cloning a release, modifying the docker file, and building the
>> containers create a "new" container with a default release tag? If so,
>> we should discourage that:
>> https://beam.apache.org/documentation/runtime/environments/#modifying-dockerfiles
>>
>> > > make it consistent for all languages
>> >
>> > Forgot to reply to this point -- I agree, +1.
>>
>> Also +1
>>
>> > [1] 
>> > https://lists.apache.org/thread.html/7b5599f142785e616a1e943ff1c3da5213de370ed193373e01991bb6%40%3Cdev.beam.apache.org%3E
>> >
>> > On Fri, Jan 10, 2020 at 9:52 AM Hannah Jiang  
>> > wrote:
>> >>
>> >> >> This has a minor downside for the users who are using unreleased 
>> >> >> versions. They need to build a local image first before using docker 
>> >> >> to run.
>> >> > Isn't that the current behavior?
>> >>
>> >> Our current behavior is pull & run. So in case both local and remote 
>> >> images are available, the local image is getting overwritten by the 
>> >> remote image.
>> >> A New approach will do run only, which will pull remote images only when 
>> >> local images are not available. Since we don't deploy different images 
>> >> with the same tag, we can assume the images with the same tag are always 
>> >> identical, unless a user customized it with the same tag.
>> >>
>> >> This has the following advantages.
>> >> 1. pull only when needed, so reduce unnecessary traffic for users.
>> >> 2. In case a user customize the image and rebuild it with the default 
>> >> tag, the local customized image is used as expected. With pull & run, 
>> >> remote image, instead of the customized image, is used.
>> >>
>> >> On Thu, Jan 9, 2020 at 4:54 PM Kyle Weaver  wrote:
>> >>>
>> >>> > This has a minor downside for the users who are using unreleased 
>> >>> > versions. They need to build a local image first before using docker 
>> >>> > to run.
>> >>>
>> >>> Isn't that the current behavior?
>> >>>
>> >>> On Thu, Jan 9, 2020 at 4:48 PM Hannah Jiang  
>> >>> wrote:
>> 
>>  Hi Community
>> 
>>  Now we are using different default tags for Python(version or 
>>  version.dev), Java(version-SNAPSHOT) and Go(latest). I would like to 
>>  clean it up and make it consistent for all languages and here is my 
>>  proposal.
>> 
>>  For the released version of SDKs, the default tag will be version 
>>  number. (ex: 2.17.0)
>>  For the unreleased version of SDKs, the default tag will be version 
>>  number + '.dev'. (ex: 2.18.0.dev)
>> 
>>  The default 

Re: [PROPOSAL] Transition released containers to the official ASF dockerhub organization

2020-01-10 Thread Ahmet Altay
On Fri, Jan 10, 2020 at 3:32 PM Ankur Goenka  wrote:

> Also curious to know if apache provide any infra support fro projects
> under Apache umbrella and any quota limits they might have.
>
> On Fri, Jan 10, 2020, 2:26 PM Robert Bradshaw  wrote:
>
>> One downside is that, unlike many of these projects, we release a
>> dozen or so containers. Is there exactly (and only) one level of
>> namespacing/nesting we can leverage here? (This isn't a blocker, but
>> something to consider.)
>>
>
After a quick search, I could not find a way to use more than one level of
repositories. We can use the naming scheme we currently use to help with.
Our repositories are named as apachebeam/X, we could start using
apache/beam/X.


>
>> On Fri, Jan 10, 2020 at 2:06 PM Hannah Jiang 
>> wrote:
>> >
>> > Thanks Ahmet for proposing it.
>> > I will take it and work towards v2.19.
>> >
>> > Hannah
>> >
>> > On Fri, Jan 10, 2020 at 1:50 PM Kyle Weaver 
>> wrote:
>> >>
>> >> It'd be nice to have the clout/official sheen of apache attached to
>> our containers. Although getting the required permissions might add some
>> small overhead to the release process. For example, yesterday, when we
>> needed to create new repositories (not just update existing ones), since we
>> have top-level ownership of the apachebeam organization, it was quick and
>> easy to add them. I imagine we'd have had to get approval from someone
>> outside the project to do that under the apache org. But this won't need to
>> happen very often, so it's probably not that big a deal.
>> >>
>> >> On Fri, Jan 10, 2020 at 1:40 PM Ahmet Altay  wrote:
>> >>>
>> >>> Hi all,
>> >>>
>> >>> I saw recent progress on the containers and wanted to bring this
>> question to the attention of the dev list.
>> >>>
>> >>> Would it be possible to use the official ASF dockerhub organization
>> for new Beam container releases? Concretely, starting from 2.19 could we
>> release Beam containers to https://hub.docker.com/u/apache instead of
>> https://hub.docker.com/u/apachebeam ?
>> >>>
>> >>> Ahmet
>>
>


Re: [PROPOSAL] Transition released containers to the official ASF dockerhub organization

2020-01-10 Thread Ankur Goenka
Also curious to know if apache provide any infra support fro projects under
Apache umbrella and any quota limits they might have.

On Fri, Jan 10, 2020, 2:26 PM Robert Bradshaw  wrote:

> One downside is that, unlike many of these projects, we release a
> dozen or so containers. Is there exactly (and only) one level of
> namespacing/nesting we can leverage here? (This isn't a blocker, but
> something to consider.)
>
> On Fri, Jan 10, 2020 at 2:06 PM Hannah Jiang 
> wrote:
> >
> > Thanks Ahmet for proposing it.
> > I will take it and work towards v2.19.
> >
> > Hannah
> >
> > On Fri, Jan 10, 2020 at 1:50 PM Kyle Weaver  wrote:
> >>
> >> It'd be nice to have the clout/official sheen of apache attached to our
> containers. Although getting the required permissions might add some small
> overhead to the release process. For example, yesterday, when we needed to
> create new repositories (not just update existing ones), since we have
> top-level ownership of the apachebeam organization, it was quick and easy
> to add them. I imagine we'd have had to get approval from someone outside
> the project to do that under the apache org. But this won't need to happen
> very often, so it's probably not that big a deal.
> >>
> >> On Fri, Jan 10, 2020 at 1:40 PM Ahmet Altay  wrote:
> >>>
> >>> Hi all,
> >>>
> >>> I saw recent progress on the containers and wanted to bring this
> question to the attention of the dev list.
> >>>
> >>> Would it be possible to use the official ASF dockerhub organization
> for new Beam container releases? Concretely, starting from 2.19 could we
> release Beam containers to https://hub.docker.com/u/apache instead of
> https://hub.docker.com/u/apachebeam ?
> >>>
> >>> Ahmet
>


Re: Cleaning up SDK docker image tagging

2020-01-10 Thread Kyle Weaver
> Does cloning a release, modifying the docker file, and building the
> containers create a "new" container with a default release tag? If so,
> we should discourage that

Yes, and agreed. The doc you linked already mentions how to customize tags,
maybe we could also recommend the user always makes their own tag whenever
changing a released image.

On Fri, Jan 10, 2020 at 2:33 PM Robert Bradshaw  wrote:

> On Fri, Jan 10, 2020 at 12:48 PM Kyle Weaver  wrote:
> >
> > > Shall we ALSO tag the image with git commit version for local build to
> keep track of obsolete images.
> >
> > This would mean we would have to be able to access the git commit from
> the source, which might not be trivial (right now the Beam version e.g. "
> 2.18.0.dev" is hard-coded in some properties files). And the way it is
> now keeps things simple and easy to read.
>
> It also means that as you're developing, you don't generate a long
> trail of named containers that you'll never access again but are
> harder to automatically prune.
>
> > > we can assume the images with the same tag are always identical
>
> This is only true if a developer never builds a container without
> committing any local changes first.
>
> Image tags are like git tags. They also have hashes (like commit ids)
> if one wants to ensure one is pointing to the exact same thing.
>
> > So far that's always been the case, but in case there are problems with
> the published container images and we have to update them, we want to make
> sure everyone gets the most up-to-date image [1].
> >
> > > 1. pull only when needed, so reduce unnecessary traffic for users.
> >
> > `docker pull` starts by checking if the local image is up-to-date with
> the remote, and most of the time it will be, so no more network usage
> beyond that is needed.
> >
> > > In case a user customize the image and rebuild it with the default tag
> >
> > The user should never need to build an image with the default release
> tag (e.g. 2.17.0). They will use either the .dev tag (the default) or even
> better, their own custom tag. (I suppose we can't stop users from manually
> tagging their own container with the release tag, but most people should
> know better.)
>
> Does cloning a release, modifying the docker file, and building the
> containers create a "new" container with a default release tag? If so,
> we should discourage that:
>
> https://beam.apache.org/documentation/runtime/environments/#modifying-dockerfiles
>
> > > make it consistent for all languages
> >
> > Forgot to reply to this point -- I agree, +1.
>
> Also +1
>
> > [1]
> https://lists.apache.org/thread.html/7b5599f142785e616a1e943ff1c3da5213de370ed193373e01991bb6%40%3Cdev.beam.apache.org%3E
> >
> > On Fri, Jan 10, 2020 at 9:52 AM Hannah Jiang 
> wrote:
> >>
> >> >> This has a minor downside for the users who are using unreleased
> versions. They need to build a local image first before using docker to run.
> >> > Isn't that the current behavior?
> >>
> >> Our current behavior is pull & run. So in case both local and remote
> images are available, the local image is getting overwritten by the remote
> image.
> >> A New approach will do run only, which will pull remote images only
> when local images are not available. Since we don't deploy different images
> with the same tag, we can assume the images with the same tag are always
> identical, unless a user customized it with the same tag.
> >>
> >> This has the following advantages.
> >> 1. pull only when needed, so reduce unnecessary traffic for users.
> >> 2. In case a user customize the image and rebuild it with the default
> tag, the local customized image is used as expected. With pull & run,
> remote image, instead of the customized image, is used.
> >>
> >> On Thu, Jan 9, 2020 at 4:54 PM Kyle Weaver  wrote:
> >>>
> >>> > This has a minor downside for the users who are using unreleased
> versions. They need to build a local image first before using docker to run.
> >>>
> >>> Isn't that the current behavior?
> >>>
> >>> On Thu, Jan 9, 2020 at 4:48 PM Hannah Jiang 
> wrote:
> 
>  Hi Community
> 
>  Now we are using different default tags for Python(version or
> version.dev), Java(version-SNAPSHOT) and Go(latest). I would like to
> clean it up and make it consistent for all languages and here is my
> proposal.
> 
>  For the released version of SDKs, the default tag will be version
> number. (ex: 2.17.0)
>  For the unreleased version of SDKs, the default tag will be version
> number + '.dev'. (ex: 2.18.0.dev)
> 
>  The default tag is used 1). when we build docker images without
> specifying a tag. 2) when we run a job with runners running on dockers with
> default docker images.
> 
>  Additionally, Beam will always lookup images locally before pulling
> one from remote, so the images built locally will not be overwritten by
> remote ones.
> 
>  This has a minor downside for the users who are using unreleased
> versions. They need to 

Re: Custom window invariants and

2020-01-10 Thread Aaron Dixon
Once again this is a great help, thank you Kenneth

On Wed, Jan 8, 2020 at 3:03 PM Kenneth Knowles  wrote:

> Hmm. I've seen this manifest in some other tweaked versions of Sessions.
> Your invariants are right. In fact, the Nexmark queries have auctions that
> truncate in a similar way. This prompted
> https://issues.apache.org/jira/browse/BEAM-654.  I think we have not
> really nailed down the right spec for merging, and we certainly aren't
> enforcing it. To be robust, your merging should be associative and
> commutative, which means that you can't have an "end of session" event that
> contradicts a merge that occurred. OTOH I also know that Tyler has hacked
> window functions that split... it is mostly unexplored, semantically.
>
> About the error, this may help debug: The "state address windows" for a
> given merged window are all the windows that contribute to it. This means
> that when windows A and B merge to become a window AB, we can leave the
> accumulated state stored with A and B and just note that when we read from
> AB we actually have to read from both A and B*. So suppose windows A and B
> are about to merge. Before merge, the state address window map is:
>
> A -> [A]
> B -> [B]
>
> After merge, there a new window AB and "window to state address window"
> mapping
>
> AB -> [A, B]
>
> The error means that there is more than one merged window that will read
> data from a pre-merged window. So there is a situation like
>
> AB -> [A, B]
> BC -> [B, C]
>
> This is not intended to happen. It would be the consequence of B merging
> into two different new windows. Hence it is an internal error. Most likely
> a bug or a mismatch based on the assumptions. Note that this code/logic is
> shared by all runners. I do think you can write a WindowFn that induces it.
>
> Kenn
>
> *this was intended to be a performance optimization, but eagerly copying
> the data turned out faster so now it is a legacy compatibility thing that
> we could remove it I think, but changing this code is tricky
>
> On Tue, Jan 7, 2020 at 3:27 PM Aaron Dixon  wrote:
>
>> What I'm attempting is a variation on Session windows in which there may
>> exist a "terminal" element in the stream that immediately stops the session
>> (or perhaps after some configured delay.)
>>
>> My implementation behaves just like Sessions until any such "terminal"
>> element is encountered in which case I mark the window as "terminal" and
>> all windows "merge down" such that any terminal windows get to dictate the
>> Interval.end()/Window.maxTimestamp().
>>
>> So, trivial example, if I have windows W1 [0, 100) and W2 [50, 75,
>> terminal = true] then the merged result will be W3 [0, 75).
>>
>> I've been successful doing this so far but I've been inferring some
>> invariants about windows that I'm not sure are official or documented
>> anywhere.
>>
>> The invariants that I've inferred go like this:
>>
>> (I) Definition. An element is "in" window W if it originated in W or in a
>> window that was merged into W (, recursively.)
>>
>> (II) Invariant. Any element, e, in window W MUST have e.timestamp <=
>> W.maxTimestamp().
>>
>> So far, I think this is obvious and true stuff (I hope). (It would
>> actually be better or great if there was a way for II to not have to hold,
>> but that is a whole other separate discussion I think.)
>>
>> The main invariant I'm trying to formalize is one that allows me to
>> "merge down" -- i.e., to merge in such a way that the merged window's
>> (mergedResult's) maxTimestamp *is less than* one of the source's
>> (toBeMerged's) windows' maxTimestamp.
>>
>> The (undocumented?) invariant I've been working from goes something like
>> this:
>>
>> (III) Corollary. Windows W1 and W2 can merge such that either
>> maxTimestamp() is regressed (moved backward in time aka "merge down") in
>> the merged window -- however they cannot merge such that (II) is ever
>> violated.
>>
>> Is this correct?
>>
>> (If you can this can be confirmed, I'll go back and ensure I'm not
>> violating the merge() precondition and these invariants and post some code
>> if needed..) Thank you for assistance heere!
>>
>>
>> On Tue, Jan 7, 2020 at 4:21 PM Reuven Lax  wrote:
>>
>>> Have you used Dataflow's update feature on this pipeline? Also, do
>>> you have the code for your WindowFn?
>>>
>>> On Tue, Jan 7, 2020 at 12:05 PM Aaron Dixon  wrote:
>>>
 Dataflow. (See stacktrace)

 On Tue, Jan 7, 2020 at 1:50 PM Reuven Lax  wrote:

> Which runner are you using?
>
> On Tue, Jan 7, 2020, 11:17 AM Aaron Dixon  wrote:
>
>> I get an IllegalStateException " is in more than one state
>> address window set" (stacktrace below).
>>
>> What does this mean? What invariant of custom window implementation
>> & merging am I violating?
>>
>> Thank you for any advise.
>>
>> ```
>> java.lang.IllegalStateException:
>> {[2019-12-05T01:36:48.870Z..2019-12-05T01:36:48.871Z),terminal} is in 
>> more
>> than 

Re: Cleaning up SDK docker image tagging

2020-01-10 Thread Robert Bradshaw
On Fri, Jan 10, 2020 at 12:48 PM Kyle Weaver  wrote:
>
> > Shall we ALSO tag the image with git commit version for local build to keep 
> > track of obsolete images.
>
> This would mean we would have to be able to access the git commit from the 
> source, which might not be trivial (right now the Beam version e.g. 
> "2.18.0.dev" is hard-coded in some properties files). And the way it is now 
> keeps things simple and easy to read.

It also means that as you're developing, you don't generate a long
trail of named containers that you'll never access again but are
harder to automatically prune.

> > we can assume the images with the same tag are always identical

This is only true if a developer never builds a container without
committing any local changes first.

Image tags are like git tags. They also have hashes (like commit ids)
if one wants to ensure one is pointing to the exact same thing.

> So far that's always been the case, but in case there are problems with the 
> published container images and we have to update them, we want to make sure 
> everyone gets the most up-to-date image [1].
>
> > 1. pull only when needed, so reduce unnecessary traffic for users.
>
> `docker pull` starts by checking if the local image is up-to-date with the 
> remote, and most of the time it will be, so no more network usage beyond that 
> is needed.
>
> > In case a user customize the image and rebuild it with the default tag
>
> The user should never need to build an image with the default release tag 
> (e.g. 2.17.0). They will use either the .dev tag (the default) or even 
> better, their own custom tag. (I suppose we can't stop users from manually 
> tagging their own container with the release tag, but most people should know 
> better.)

Does cloning a release, modifying the docker file, and building the
containers create a "new" container with a default release tag? If so,
we should discourage that:
https://beam.apache.org/documentation/runtime/environments/#modifying-dockerfiles

> > make it consistent for all languages
>
> Forgot to reply to this point -- I agree, +1.

Also +1

> [1] 
> https://lists.apache.org/thread.html/7b5599f142785e616a1e943ff1c3da5213de370ed193373e01991bb6%40%3Cdev.beam.apache.org%3E
>
> On Fri, Jan 10, 2020 at 9:52 AM Hannah Jiang  wrote:
>>
>> >> This has a minor downside for the users who are using unreleased 
>> >> versions. They need to build a local image first before using docker to 
>> >> run.
>> > Isn't that the current behavior?
>>
>> Our current behavior is pull & run. So in case both local and remote images 
>> are available, the local image is getting overwritten by the remote image.
>> A New approach will do run only, which will pull remote images only when 
>> local images are not available. Since we don't deploy different images with 
>> the same tag, we can assume the images with the same tag are always 
>> identical, unless a user customized it with the same tag.
>>
>> This has the following advantages.
>> 1. pull only when needed, so reduce unnecessary traffic for users.
>> 2. In case a user customize the image and rebuild it with the default tag, 
>> the local customized image is used as expected. With pull & run, remote 
>> image, instead of the customized image, is used.
>>
>> On Thu, Jan 9, 2020 at 4:54 PM Kyle Weaver  wrote:
>>>
>>> > This has a minor downside for the users who are using unreleased 
>>> > versions. They need to build a local image first before using docker to 
>>> > run.
>>>
>>> Isn't that the current behavior?
>>>
>>> On Thu, Jan 9, 2020 at 4:48 PM Hannah Jiang  wrote:

 Hi Community

 Now we are using different default tags for Python(version or 
 version.dev), Java(version-SNAPSHOT) and Go(latest). I would like to clean 
 it up and make it consistent for all languages and here is my proposal.

 For the released version of SDKs, the default tag will be version number. 
 (ex: 2.17.0)
 For the unreleased version of SDKs, the default tag will be version number 
 + '.dev'. (ex: 2.18.0.dev)

 The default tag is used 1). when we build docker images without specifying 
 a tag. 2) when we run a job with runners running on dockers with default 
 docker images.

 Additionally, Beam will always lookup images locally before pulling one 
 from remote, so the images built locally will not be overwritten by remote 
 ones.

 This has a minor downside for the users who are using unreleased versions. 
 They need to build a local image first before using docker to run. I will 
 add a clear error message to show the problem and add a link to a 
 documentation of how to create images.

 I would like to collect feedback from whoever uses dockers. Does this 
 sound good? Is there anything I am missing?

 Thanks,
 Hannah









Re: [PROPOSAL] Transition released containers to the official ASF dockerhub organization

2020-01-10 Thread Robert Bradshaw
One downside is that, unlike many of these projects, we release a
dozen or so containers. Is there exactly (and only) one level of
namespacing/nesting we can leverage here? (This isn't a blocker, but
something to consider.)

On Fri, Jan 10, 2020 at 2:06 PM Hannah Jiang  wrote:
>
> Thanks Ahmet for proposing it.
> I will take it and work towards v2.19.
>
> Hannah
>
> On Fri, Jan 10, 2020 at 1:50 PM Kyle Weaver  wrote:
>>
>> It'd be nice to have the clout/official sheen of apache attached to our 
>> containers. Although getting the required permissions might add some small 
>> overhead to the release process. For example, yesterday, when we needed to 
>> create new repositories (not just update existing ones), since we have 
>> top-level ownership of the apachebeam organization, it was quick and easy to 
>> add them. I imagine we'd have had to get approval from someone outside the 
>> project to do that under the apache org. But this won't need to happen very 
>> often, so it's probably not that big a deal.
>>
>> On Fri, Jan 10, 2020 at 1:40 PM Ahmet Altay  wrote:
>>>
>>> Hi all,
>>>
>>> I saw recent progress on the containers and wanted to bring this question 
>>> to the attention of the dev list.
>>>
>>> Would it be possible to use the official ASF dockerhub organization for new 
>>> Beam container releases? Concretely, starting from 2.19 could we release 
>>> Beam containers to https://hub.docker.com/u/apache instead of 
>>> https://hub.docker.com/u/apachebeam ?
>>>
>>> Ahmet


Re: [PROPOSAL] Transition released containers to the official ASF dockerhub organization

2020-01-10 Thread Hannah Jiang
Thanks Ahmet for proposing it.
I will take it and work towards v2.19.

Hannah

On Fri, Jan 10, 2020 at 1:50 PM Kyle Weaver  wrote:

> It'd be nice to have the clout/official sheen of apache attached to our
> containers. Although getting the required permissions might add some small
> overhead to the release process. For example, yesterday, when we needed to
> create new repositories (not just update existing ones), since we have
> top-level ownership of the apachebeam organization, it was quick and easy
> to add them. I imagine we'd have had to get approval from someone outside
> the project to do that under the apache org. But this won't need to happen
> very often, so it's probably not that big a deal.
>
> On Fri, Jan 10, 2020 at 1:40 PM Ahmet Altay  wrote:
>
>> Hi all,
>>
>> I saw recent progress on the containers and wanted to bring this question
>> to the attention of the dev list.
>>
>> Would it be possible to use the official ASF dockerhub organization for
>> new Beam container releases? Concretely, starting from 2.19 could we
>> release Beam containers to https://hub.docker.com/u/apache instead of
>> https://hub.docker.com/u/apachebeam ?
>>
>> Ahmet
>>
>


Re: Cleaning up SDK docker image tagging

2020-01-10 Thread Hannah Jiang
Thanks for pointing me to the thread. I agree with what discussed there,
let's keep it as it is.
I will proceed with cleaning up tags only.

On Fri, Jan 10, 2020 at 12:48 PM Kyle Weaver  wrote:

> > Shall we ALSO tag the image with git commit version for local build to
> keep track of obsolete images.
>
> This would mean we would have to be able to access the git commit from the
> source, which might not be trivial (right now the Beam version e.g. "
> 2.18.0.dev" is hard-coded in some properties files). And the way it is
> now keeps things simple and easy to read.
>
> > we can assume the images with the same tag are always identical
>
> So far that's always been the case, but in case there are problems with
> the published container images and we have to update them, we want to make
> sure everyone gets the most up-to-date image [1].
>
> > 1. pull only when needed, so reduce unnecessary traffic for users.
>
> `docker pull` starts by checking if the local image is up-to-date with the
> remote, and most of the time it will be, so no more network usage beyond
> that is needed.
>
> > In case a user customize the image and rebuild it with the default tag
>
> The user should never need to build an image with the default release tag
> (e.g. 2.17.0). They will use either the .dev tag (the default) or even
> better, their own custom tag. (I suppose we can't stop users from manually
> tagging their own container with the release tag, but most people should
> know better.)
>
> > make it consistent for all languages
>
> Forgot to reply to this point -- I agree, +1.
>
> [1]
> https://lists.apache.org/thread.html/7b5599f142785e616a1e943ff1c3da5213de370ed193373e01991bb6%40%3Cdev.beam.apache.org%3E
>
> On Fri, Jan 10, 2020 at 9:52 AM Hannah Jiang 
> wrote:
>
>> >> This has a minor downside for the users who are using unreleased
>> versions. They need to build a local image first before using docker to run.
>> > Isn't that the current behavior?
>>
>> Our current behavior is pull & run. So in case both local and remote
>> images are available, the local image is getting overwritten by the remote
>> image.
>> A New approach will do run only, which will pull remote images only when
>> local images are not available. Since we don't deploy different images with
>> the same tag, we can assume the images with the same tag are always
>> identical, unless a user customized it with the same tag.
>>
>> This has the following advantages.
>> 1. pull only when needed, so reduce unnecessary traffic for users.
>> 2. In case a user customize the image and rebuild it with the default
>> tag, the local customized image is used as expected. With pull & run,
>> remote image, instead of the customized image, is used.
>>
>> On Thu, Jan 9, 2020 at 4:54 PM Kyle Weaver  wrote:
>>
>>> > This has a minor downside for the users who are using unreleased
>>> versions. They need to build a local image first before using docker to run.
>>>
>>> Isn't that the current behavior?
>>>
>>> On Thu, Jan 9, 2020 at 4:48 PM Hannah Jiang 
>>> wrote:
>>>
 Hi Community

 Now we are using different default tags for Python(version or
 version.dev), Java(version-SNAPSHOT) and Go(latest). I would like to
 clean it up and make it consistent for all languages and here is my
 proposal.

 For the released version of SDKs, the default tag will be version
 number. (ex: 2.17.0)
 For the unreleased version of SDKs, the default tag will be version
 number + '.dev'. (ex: 2.18.0.dev)

 The default tag is used 1). when we build docker images without
 specifying a tag. 2) when we run a job with runners running on dockers with
 default docker images.

 Additionally, Beam will always lookup images locally before pulling one
 from remote, so the images built locally will not be overwritten by remote
 ones.

 This has a minor downside for the users who are using unreleased
 versions. They need to build a local image first before using docker to
 run. I will add a clear error message to show the problem and add a link to
 a documentation of how to create images.

 I would like to collect feedback from whoever uses dockers. Does this
 sound good? Is there anything I am missing?

 Thanks,
 Hannah










Re: Jenkins jobs not running for my PR 10438

2020-01-10 Thread Andrew Pilloud
Done.

On Fri, Jan 10, 2020 at 12:59 PM Tomo Suzuki  wrote:

> Hi Bean developers,
>
> I appreciate a committer can trigger precommit build for
> https://github.com/apache/beam/pull/10554.
>
> In addition to normal precommit checks, I want the followings:
> Run Java PostCommit
> Run Java HadoopFormatIO Performance Test
> Run BigQueryIO Streaming Performance Test Java
> Run Dataflow ValidatesRunner
> Run Spark ValidatesRunner
> Run SQL Postcommit
>
> Regards,
> Tomo
>


Re: [PROPOSAL] Transition released containers to the official ASF dockerhub organization

2020-01-10 Thread Kyle Weaver
It'd be nice to have the clout/official sheen of apache attached to our
containers. Although getting the required permissions might add some small
overhead to the release process. For example, yesterday, when we needed to
create new repositories (not just update existing ones), since we have
top-level ownership of the apachebeam organization, it was quick and easy
to add them. I imagine we'd have had to get approval from someone outside
the project to do that under the apache org. But this won't need to happen
very often, so it's probably not that big a deal.

On Fri, Jan 10, 2020 at 1:40 PM Ahmet Altay  wrote:

> Hi all,
>
> I saw recent progress on the containers and wanted to bring this question
> to the attention of the dev list.
>
> Would it be possible to use the official ASF dockerhub organization for
> new Beam container releases? Concretely, starting from 2.19 could we
> release Beam containers to https://hub.docker.com/u/apache instead of
> https://hub.docker.com/u/apachebeam ?
>
> Ahmet
>


[PROPOSAL] Transition released containers to the official ASF dockerhub organization

2020-01-10 Thread Ahmet Altay
Hi all,

I saw recent progress on the containers and wanted to bring this question
to the attention of the dev list.

Would it be possible to use the official ASF dockerhub organization for new
Beam container releases? Concretely, starting from 2.19 could we release
Beam containers to https://hub.docker.com/u/apache instead of
https://hub.docker.com/u/apachebeam ?

Ahmet


Re: Jenkins jobs not running for my PR 10438

2020-01-10 Thread Tomo Suzuki
Thank you, Andrew!

On Fri, Jan 10, 2020 at 15:59 Tomo Suzuki  wrote:

> Hi Bean developers,
>
> I appreciate a committer can trigger precommit build for
> https://github.com/apache/beam/pull/10554.
>
> In addition to normal precommit checks, I want the followings:
> Run Java PostCommit
> Run Java HadoopFormatIO Performance Test
> Run BigQueryIO Streaming Performance Test Java
> Run Dataflow ValidatesRunner
> Run Spark ValidatesRunner
> Run SQL Postcommit
>
> Regards,
> Tomo
>
-- 
Regards,
Tomo


Re: Jenkins jobs not running for my PR 10438

2020-01-10 Thread Tomo Suzuki
Hi Bean developers,

I appreciate a committer can trigger precommit build for
https://github.com/apache/beam/pull/10554.

In addition to normal precommit checks, I want the followings:
Run Java PostCommit
Run Java HadoopFormatIO Performance Test
Run BigQueryIO Streaming Performance Test Java
Run Dataflow ValidatesRunner
Run Spark ValidatesRunner
Run SQL Postcommit

Regards,
Tomo


Re: [ANNOUNCE] Beam 2.17.0 Released!

2020-01-10 Thread Ahmet Altay
Thank you Mikhail!

On Fri, Jan 10, 2020 at 12:40 PM Kyle Weaver  wrote:

> Hooray! Thanks to Mikhail and everyone else who contributed.
>
> On Fri, Jan 10, 2020 at 10:23 AM Maximilian Michels 
> wrote:
>
>> At last :) Thank you for making it happen Mikhail! Also thanks to
>> everyone else who tested the release candidate.
>>
>> Cheers,
>> Max
>>
>> On 10.01.20 19:01, Mikhail Gryzykhin wrote:
>> > The Apache Beam team is pleased to announce the release of version
>> 2.17.0.
>> >
>> > Apache Beam is an open source unified programming model to define and
>> > execute data processing pipelines, including ETL, batch and stream
>> > (continuous) processing. See https://beam.apache.org
>> > 
>> >
>> > You can download the release here:
>> >
>> > https://beam.apache.org/get-started/downloads/
>> >
>> > This release includes bug fixes, features, and improvements detailed on
>> > the Beam blog: https://beam.apache.org/blog/2020/01/06/beam-2.17.0.html
>> > 
>> >
>> > Thanks to everyone who contributed to this release, and we hope you
>> > enjoy using Beam 2.17.0.
>>
>


Re: Cleaning up SDK docker image tagging

2020-01-10 Thread Kyle Weaver
> Shall we ALSO tag the image with git commit version for local build to
keep track of obsolete images.

This would mean we would have to be able to access the git commit from the
source, which might not be trivial (right now the Beam version e.g. "
2.18.0.dev" is hard-coded in some properties files). And the way it is now
keeps things simple and easy to read.

> we can assume the images with the same tag are always identical

So far that's always been the case, but in case there are problems with the
published container images and we have to update them, we want to make sure
everyone gets the most up-to-date image [1].

> 1. pull only when needed, so reduce unnecessary traffic for users.

`docker pull` starts by checking if the local image is up-to-date with the
remote, and most of the time it will be, so no more network usage beyond
that is needed.

> In case a user customize the image and rebuild it with the default tag

The user should never need to build an image with the default release tag
(e.g. 2.17.0). They will use either the .dev tag (the default) or even
better, their own custom tag. (I suppose we can't stop users from manually
tagging their own container with the release tag, but most people should
know better.)

> make it consistent for all languages

Forgot to reply to this point -- I agree, +1.

[1]
https://lists.apache.org/thread.html/7b5599f142785e616a1e943ff1c3da5213de370ed193373e01991bb6%40%3Cdev.beam.apache.org%3E

On Fri, Jan 10, 2020 at 9:52 AM Hannah Jiang  wrote:

> >> This has a minor downside for the users who are using unreleased
> versions. They need to build a local image first before using docker to run.
> > Isn't that the current behavior?
>
> Our current behavior is pull & run. So in case both local and remote
> images are available, the local image is getting overwritten by the remote
> image.
> A New approach will do run only, which will pull remote images only when
> local images are not available. Since we don't deploy different images with
> the same tag, we can assume the images with the same tag are always
> identical, unless a user customized it with the same tag.
>
> This has the following advantages.
> 1. pull only when needed, so reduce unnecessary traffic for users.
> 2. In case a user customize the image and rebuild it with the default tag,
> the local customized image is used as expected. With pull & run, remote
> image, instead of the customized image, is used.
>
> On Thu, Jan 9, 2020 at 4:54 PM Kyle Weaver  wrote:
>
>> > This has a minor downside for the users who are using unreleased
>> versions. They need to build a local image first before using docker to run.
>>
>> Isn't that the current behavior?
>>
>> On Thu, Jan 9, 2020 at 4:48 PM Hannah Jiang 
>> wrote:
>>
>>> Hi Community
>>>
>>> Now we are using different default tags for Python(version or
>>> version.dev), Java(version-SNAPSHOT) and Go(latest). I would like to
>>> clean it up and make it consistent for all languages and here is my
>>> proposal.
>>>
>>> For the released version of SDKs, the default tag will be version
>>> number. (ex: 2.17.0)
>>> For the unreleased version of SDKs, the default tag will be version
>>> number + '.dev'. (ex: 2.18.0.dev)
>>>
>>> The default tag is used 1). when we build docker images without
>>> specifying a tag. 2) when we run a job with runners running on dockers with
>>> default docker images.
>>>
>>> Additionally, Beam will always lookup images locally before pulling one
>>> from remote, so the images built locally will not be overwritten by remote
>>> ones.
>>>
>>> This has a minor downside for the users who are using unreleased
>>> versions. They need to build a local image first before using docker to
>>> run. I will add a clear error message to show the problem and add a link to
>>> a documentation of how to create images.
>>>
>>> I would like to collect feedback from whoever uses dockers. Does this
>>> sound good? Is there anything I am missing?
>>>
>>> Thanks,
>>> Hannah
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>


Re: Cleaning up SDK docker image tagging

2020-01-10 Thread Valentyn Tymofieiev
Hi Hannah,

+1 to standardize .dev suffixes across all SDKs.

Whether to pull or not to pull was recently discussed in [1]. My personal
preference would be to pull images before starting the containers, and
instructing users who want to customize containers to tag them with a new
tag, such as :customized-2.17.0. If we need to revisit this conversation,
consider continuing it in [1] to keep the conversation in one place.

[1]
https://lists.apache.org/thread.html/07131e314e229ec60100eaa2c0cf6dfc206bf2b0f78c3cee9ebb0bda@%3Cdev.beam.apache.org%3E

Thanks,
Valentyn

On Fri, Jan 10, 2020 at 9:52 AM Hannah Jiang  wrote:

> >> This has a minor downside for the users who are using unreleased
> versions. They need to build a local image first before using docker to run.
> > Isn't that the current behavior?
>
> Our current behavior is pull & run. So in case both local and remote
> images are available, the local image is getting overwritten by the remote
> image.
> A New approach will do run only, which will pull remote images only when
> local images are not available. Since we don't deploy different images with
> the same tag, we can assume the images with the same tag are always
> identical, unless a user customized it with the same tag.
>
> This has the following advantages.
> 1. pull only when needed, so reduce unnecessary traffic for users.
>
2. In case a user customize the image and rebuild it with the default tag,
> the local customized image is used as expected. With pull & run, remote
> image, instead of the customized image, is used.
>
> On Thu, Jan 9, 2020 at 4:54 PM Kyle Weaver  wrote:
>
>> > This has a minor downside for the users who are using unreleased
>> versions. They need to build a local image first before using docker to run.
>>
>> Isn't that the current behavior?
>>
>> On Thu, Jan 9, 2020 at 4:48 PM Hannah Jiang 
>> wrote:
>>
>>> Hi Community
>>>
>>> Now we are using different default tags for Python(version or
>>> version.dev), Java(version-SNAPSHOT) and Go(latest). I would like to
>>> clean it up and make it consistent for all languages and here is my
>>> proposal.
>>>
>>> For the released version of SDKs, the default tag will be version
>>> number. (ex: 2.17.0)
>>> For the unreleased version of SDKs, the default tag will be version
>>> number + '.dev'. (ex: 2.18.0.dev)
>>>
>>> The default tag is used 1). when we build docker images without
>>> specifying a tag. 2) when we run a job with runners running on dockers with
>>> default docker images.
>>>
>>> Additionally, Beam will always lookup images locally before pulling one
>>> from remote, so the images built locally will not be overwritten by remote
>>> ones.
>>>
>>> This has a minor downside for the users who are using unreleased
>>> versions. They need to build a local image first before using docker to
>>> run. I will add a clear error message to show the problem and add a link to
>>> a documentation of how to create images.
>>>
>>> I would like to collect feedback from whoever uses dockers. Does this
>>> sound good? Is there anything I am missing?
>>>
>>> Thanks,
>>> Hannah
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>


Re: [ANNOUNCE] Beam 2.17.0 Released!

2020-01-10 Thread Kyle Weaver
Hooray! Thanks to Mikhail and everyone else who contributed.

On Fri, Jan 10, 2020 at 10:23 AM Maximilian Michels  wrote:

> At last :) Thank you for making it happen Mikhail! Also thanks to
> everyone else who tested the release candidate.
>
> Cheers,
> Max
>
> On 10.01.20 19:01, Mikhail Gryzykhin wrote:
> > The Apache Beam team is pleased to announce the release of version
> 2.17.0.
> >
> > Apache Beam is an open source unified programming model to define and
> > execute data processing pipelines, including ETL, batch and stream
> > (continuous) processing. See https://beam.apache.org
> > 
> >
> > You can download the release here:
> >
> > https://beam.apache.org/get-started/downloads/
> >
> > This release includes bug fixes, features, and improvements detailed on
> > the Beam blog: https://beam.apache.org/blog/2020/01/06/beam-2.17.0.html
> > 
> >
> > Thanks to everyone who contributed to this release, and we hope you
> > enjoy using Beam 2.17.0.
>


Re: [ANNOUNCE] Beam 2.17.0 Released!

2020-01-10 Thread Ankur Goenka
Thanks for persistent and powering through all the issues.

On Fri, Jan 10, 2020 at 10:23 AM Maximilian Michels  wrote:

> At last :) Thank you for making it happen Mikhail! Also thanks to
> everyone else who tested the release candidate.
>
> Cheers,
> Max
>
> On 10.01.20 19:01, Mikhail Gryzykhin wrote:
> > The Apache Beam team is pleased to announce the release of version
> 2.17.0.
> >
> > Apache Beam is an open source unified programming model to define and
> > execute data processing pipelines, including ETL, batch and stream
> > (continuous) processing. See https://beam.apache.org
> > 
> >
> > You can download the release here:
> >
> > https://beam.apache.org/get-started/downloads/
> >
> > This release includes bug fixes, features, and improvements detailed on
> > the Beam blog: https://beam.apache.org/blog/2020/01/06/beam-2.17.0.html
> > 
> >
> > Thanks to everyone who contributed to this release, and we hope you
> > enjoy using Beam 2.17.0.
>


Re: release scripts as interactive notebooks?

2020-01-10 Thread Kenneth Knowles
I think we need to balance where we are with where we want to be. There are
a couple layers of abstraction that are all independently useful. We have
to acknowledge that currently the steps in the guide and scripts don't
quite match and also are not quite right, and that things go wrong and
change. So I would suggest layering the functionality and keeping it
working at all layers, never treating the topmost layer as a black box.

1. simple readable commands to perform the release, like Robert said
2. scripts that can be targeted at a git commit in the current checkout,
like usual context-sensitive git commands
3. scripts that do the checking out for you, like the current scripts,
operating on tags or commit hashes specified, context-insensitive

I think useful properties of the layers are:

 - use git in normal ways (don't do the maven/gradle release plugin thing
of committing then reverting)
 - each step should be be non-interactive (take parameters, don't ask
interactive questions)
 - have a dry run mode
 - be idempotent (implied: clean up completely; use temp dirs; etc)
 - make changes to the release branch and master through PRs whenever
possible; avoid blind pushes

Kenn

On Fri, Jan 10, 2020 at 10:09 AM Robert Bradshaw 
wrote:

> +1 to automating more, at least the creation and validation of release
> artifacts should all be completely automated. However signing should
> still be done by an individual--that's not something that
> (semantically) should be automated away.
>
> As much as I am a fan of jupyter notebooks, I think the simplicity of
> these being flat text files has value. I would often be running these
> commands over an ssh terminal and having to start up a remote server
> and forward ports would be a pain (and a hassle even locally that
> might not be worth it). That being said, IMHO to continue to be
> readable such scripts should contain as little logic as possible, and
> be just a listing of commands.
>
> On Fri, Jan 10, 2020 at 9:44 AM Luke Cwik  wrote:
> >
> > I was always under the impression that artifact creation, signing and
> staging for voting we always wanted to be "automated" in some way. I
> believe we could have a jenkins job do this if we had a good way to
> transfer the release managers signing keys to a Jenkins worker (via cloud
> key management system?). So should we we should focus on better and more
> reliable release automation instead of making the release scripts more
> interactive?
> >
> > On Fri, Jan 10, 2020 at 9:37 AM Udi Meiri  wrote:
> >>
> >> What does the community think about converting our release scripts to
> >> be Jupyter notebooks using bash_kernel?
> >>
> >> Since these scripts frequently fail (especially for first time
> releasers), we often need to rerun parts manually. The notebook format lets
> you do that.
> >>
> >> Certain steps require verification/inspection, such as before pushing
> commits. This is naturally done by spitting into multiple notebook cells.
> >>
> >> The notebook format also lends itself well to inline documentation and
> on-the-fly modification.
> >>
> >>
>


Re: [ANNOUNCE] Beam 2.17.0 Released!

2020-01-10 Thread Maximilian Michels
At last :) Thank you for making it happen Mikhail! Also thanks to 
everyone else who tested the release candidate.


Cheers,
Max

On 10.01.20 19:01, Mikhail Gryzykhin wrote:

The Apache Beam team is pleased to announce the release of version 2.17.0.

Apache Beam is an open source unified programming model to define and
execute data processing pipelines, including ETL, batch and stream
(continuous) processing. See https://beam.apache.org 



You can download the release here:

https://beam.apache.org/get-started/downloads/

This release includes bug fixes, features, and improvements detailed on
the Beam blog: https://beam.apache.org/blog/2020/01/06/beam-2.17.0.html 



Thanks to everyone who contributed to this release, and we hope you 
enjoy using Beam 2.17.0.


Re: release scripts as interactive notebooks?

2020-01-10 Thread Robert Bradshaw
+1 to automating more, at least the creation and validation of release
artifacts should all be completely automated. However signing should
still be done by an individual--that's not something that
(semantically) should be automated away.

As much as I am a fan of jupyter notebooks, I think the simplicity of
these being flat text files has value. I would often be running these
commands over an ssh terminal and having to start up a remote server
and forward ports would be a pain (and a hassle even locally that
might not be worth it). That being said, IMHO to continue to be
readable such scripts should contain as little logic as possible, and
be just a listing of commands.

On Fri, Jan 10, 2020 at 9:44 AM Luke Cwik  wrote:
>
> I was always under the impression that artifact creation, signing and staging 
> for voting we always wanted to be "automated" in some way. I believe we could 
> have a jenkins job do this if we had a good way to transfer the release 
> managers signing keys to a Jenkins worker (via cloud key management system?). 
> So should we we should focus on better and more reliable release automation 
> instead of making the release scripts more interactive?
>
> On Fri, Jan 10, 2020 at 9:37 AM Udi Meiri  wrote:
>>
>> What does the community think about converting our release scripts to
>> be Jupyter notebooks using bash_kernel?
>>
>> Since these scripts frequently fail (especially for first time releasers), 
>> we often need to rerun parts manually. The notebook format lets you do that.
>>
>> Certain steps require verification/inspection, such as before pushing 
>> commits. This is naturally done by spitting into multiple notebook cells.
>>
>> The notebook format also lends itself well to inline documentation and 
>> on-the-fly modification.
>>
>>


[ANNOUNCE] Beam 2.17.0 Released!

2020-01-10 Thread Mikhail Gryzykhin
The Apache Beam team is pleased to announce the release of version 2.17.0.

Apache Beam is an open source unified programming model to define and
execute data processing pipelines, including ETL, batch and stream
(continuous) processing. See https://beam.apache.org

You can download the release here:

https://beam.apache.org/get-started/downloads/

This release includes bug fixes, features, and improvements detailed on
the Beam blog: https://beam.apache.org/blog/2020/01/06/beam-2.17.0.html

Thanks to everyone who contributed to this release, and we hope you enjoy
using Beam 2.17.0.


Re: Cleaning up SDK docker image tagging

2020-01-10 Thread Hannah Jiang
>> This has a minor downside for the users who are using unreleased
versions. They need to build a local image first before using docker to run.
> Isn't that the current behavior?

Our current behavior is pull & run. So in case both local and remote images
are available, the local image is getting overwritten by the remote image.
A New approach will do run only, which will pull remote images only when
local images are not available. Since we don't deploy different images with
the same tag, we can assume the images with the same tag are always
identical, unless a user customized it with the same tag.

This has the following advantages.
1. pull only when needed, so reduce unnecessary traffic for users.
2. In case a user customize the image and rebuild it with the default tag,
the local customized image is used as expected. With pull & run, remote
image, instead of the customized image, is used.

On Thu, Jan 9, 2020 at 4:54 PM Kyle Weaver  wrote:

> > This has a minor downside for the users who are using unreleased
> versions. They need to build a local image first before using docker to run.
>
> Isn't that the current behavior?
>
> On Thu, Jan 9, 2020 at 4:48 PM Hannah Jiang 
> wrote:
>
>> Hi Community
>>
>> Now we are using different default tags for Python(version or version.dev),
>> Java(version-SNAPSHOT) and Go(latest). I would like to clean it up and make
>> it consistent for all languages and here is my proposal.
>>
>> For the released version of SDKs, the default tag will be version number.
>> (ex: 2.17.0)
>> For the unreleased version of SDKs, the default tag will be version
>> number + '.dev'. (ex: 2.18.0.dev)
>>
>> The default tag is used 1). when we build docker images without
>> specifying a tag. 2) when we run a job with runners running on dockers with
>> default docker images.
>>
>> Additionally, Beam will always lookup images locally before pulling one
>> from remote, so the images built locally will not be overwritten by remote
>> ones.
>>
>> This has a minor downside for the users who are using unreleased
>> versions. They need to build a local image first before using docker to
>> run. I will add a clear error message to show the problem and add a link to
>> a documentation of how to create images.
>>
>> I would like to collect feedback from whoever uses dockers. Does this
>> sound good? Is there anything I am missing?
>>
>> Thanks,
>> Hannah
>>
>>
>>
>>
>>
>>
>>
>>


Re: Request for new dockerhub repos

2020-01-10 Thread Udi Meiri
Thank you the pushes were successful.

On Fri, Jan 10, 2020 at 8:47 AM Hannah Jiang  wrote:

> Hi Udi
>
> The repositories are created. Were you added as a maintainer? If not, we
> need your docker hub user ID.
>
> Thanks,
> Hannah
>
> On Thu, Jan 9, 2020 at 5:48 PM Udi Meiri  wrote:
>
>> Hi,
>> As part of the 2.18 release, we're adding 3 additional containers for
>> Flink.
>> I have write access but since I am not an owner I cannot create new repos.
>>
>> Could someone with access add these?
>>
>> flink1.7_job_server
>> flink1.8_job_server
>> flink1.9_job_server
>>
>> (go to https://hub.docker.com/repositories and select apachebeam from
>> the dropdown)
>>
>> Thank you
>>
>


smime.p7s
Description: S/MIME Cryptographic Signature


Re: release scripts as interactive notebooks?

2020-01-10 Thread Luke Cwik
I was always under the impression that artifact creation, signing and
staging for voting we always wanted to be "automated" in some way. I
believe we could have a jenkins job do this if we had a good way to
transfer the release managers signing keys to a Jenkins worker (via cloud
key management system?). So should we we should focus on better and more
reliable release automation instead of making the release scripts more
interactive?

On Fri, Jan 10, 2020 at 9:37 AM Udi Meiri  wrote:

> What does the community think about converting our release scripts
>  to
> be Jupyter notebooks using bash_kernel?
>
> Since these scripts frequently fail (especially for first time releasers),
> we often need to rerun parts manually. The notebook format lets you do that.
>
> Certain steps require verification/inspection, such as before pushing
> commits. This is naturally done by spitting into multiple notebook cells.
>
> The notebook format also lends itself well to inline documentation and
> on-the-fly modification.
>
>
>


release scripts as interactive notebooks?

2020-01-10 Thread Udi Meiri
What does the community think about converting our release scripts
 to
be Jupyter notebooks using bash_kernel?

Since these scripts frequently fail (especially for first time releasers),
we often need to rerun parts manually. The notebook format lets you do that.

Certain steps require verification/inspection, such as before pushing
commits. This is naturally done by spitting into multiple notebook cells.

The notebook format also lends itself well to inline documentation and
on-the-fly modification.


smime.p7s
Description: S/MIME Cryptographic Signature


Re: Cleaning up SDK docker image tagging

2020-01-10 Thread Hannah Jiang
> For the unreleased version of SDKs, the default tag will be version
number + '.dev'. (ex: 2.18.0.dev)
>> Shall we ALSO tag the image with git commit version for local build to
keep track of obsolete images.

I should clarify it more clearly.
This is about release images. The dev images are only available locally
when a developer builds it from a dev version.
Release images are deployed to docker hub each time when we release a new
version of Beam.

Deploying snapshot images are on our to-do list, it will be tackled
separately later.

On Thu, Jan 9, 2020 at 5:09 PM Ankur Goenka  wrote:

> >> For the released version of SDKs, the default tag will be version
> number. (ex: 2.17.0)
> +1
>
> >> For the unreleased version of SDKs, the default tag will be version
> number + '.dev'. (ex: 2.18.0.dev)
> Shall we ALSO tag the image with git commit version for local build to
> keep track of obsolete images.
>
> On Thu, Jan 9, 2020 at 4:54 PM Kyle Weaver  wrote:
>
>> > This has a minor downside for the users who are using unreleased
>> versions. They need to build a local image first before using docker to run.
>>
>> Isn't that the current behavior?
>>
>> On Thu, Jan 9, 2020 at 4:48 PM Hannah Jiang 
>> wrote:
>>
>>> Hi Community
>>>
>>> Now we are using different default tags for Python(version or
>>> version.dev), Java(version-SNAPSHOT) and Go(latest). I would like to
>>> clean it up and make it consistent for all languages and here is my
>>> proposal.
>>>
>>> For the released version of SDKs, the default tag will be version
>>> number. (ex: 2.17.0)
>>> For the unreleased version of SDKs, the default tag will be version
>>> number + '.dev'. (ex: 2.18.0.dev)
>>>
>>> The default tag is used 1). when we build docker images without
>>> specifying a tag. 2) when we run a job with runners running on dockers with
>>> default docker images.
>>>
>>> Additionally, Beam will always lookup images locally before pulling one
>>> from remote, so the images built locally will not be overwritten by remote
>>> ones.
>>>
>>> This has a minor downside for the users who are using unreleased
>>> versions. They need to build a local image first before using docker to
>>> run. I will add a clear error message to show the problem and add a link to
>>> a documentation of how to create images.
>>>
>>> I would like to collect feedback from whoever uses dockers. Does this
>>> sound good? Is there anything I am missing?
>>>
>>> Thanks,
>>> Hannah
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>


Re: Poor Python 3.x performance on Dataflow?

2020-01-10 Thread Valentyn Tymofieiev
Thanks, Kamil. I self-assigned the issue, but if anyone else is interested,
feel free to take a look in parallel and post your findings on the Jira.

On Fri, Jan 10, 2020 at 4:29 AM Kamil Wasilewski <
kamil.wasilew...@polidea.com> wrote:

> Our first Python3 performance test has just been implemented and we have
> just started gathering results. Here[1] you can find dashboards with a
> side-by-side comparison.
> I also opened a Jira ticket to investigate the difference [2]. Anyone,
> please feel free to assign it to yourself.
>
> Thanks,
> Kamil
>
> [1]
> https://apache-beam-testing.appspot.com/explore?dashboard=5678187241537536
> [2] https://issues.apache.org/jira/browse/BEAM-9085
>
> On Mon, Dec 9, 2019 at 8:38 PM Valentyn Tymofieiev 
> wrote:
>
>> For now we should run Py3 and Py2 tests alongside each other to get a
>> side-by-side comparison. I suggest we open a Jira ticket to investigate the
>> difference in performance . We have limited performance test coverage on
>> Python 3 in Beam, so more Py3 tests would help a lot here, thanks for
>> adding them.
>>
>> On Fri, Dec 6, 2019 at 9:43 AM Robert Bradshaw 
>> wrote:
>>
>>> This is very surprising--I would expect the times to quite similar. Do
>>> you have profiles for where the (difference in) time is spent? With
>>> differences like these, I wonder if there are issues with container
>>> setup (e.g. some things not being installed or cached) for Python 3.
>>>
>>> On Fri, Dec 6, 2019 at 9:06 AM Kamil Wasilewski
>>>  wrote:
>>> >
>>> > Hi all,
>>> >
>>> > Python 2.7 won't be maintained past 2020 and that's why we want to
>>> migrate all Python performance tests in Beam from Python 2.7 to Python 3.7.
>>> However, I was surprised by seeing that after switching Dataflow tests to
>>> Python 3.x they are a few times slower. For example, the same ParDo test
>>> that takes approx. 8 minutes to run on Python 2.7 needs approx. 21 minutes
>>> on Python 3.x. You can find all the results I gathered and the setup here.
>>> >
>>> > Do you know any possible reason for this? This issue makes it
>>> impossible to do the migration, because of the limited resources on Jenkins
>>> (almost every job would be aborted).
>>> >
>>> > Thanks,
>>> > Kamil
>>>
>>


Re: Request for new dockerhub repos

2020-01-10 Thread Hannah Jiang
Hi Udi

The repositories are created. Were you added as a maintainer? If not, we
need your docker hub user ID.

Thanks,
Hannah

On Thu, Jan 9, 2020 at 5:48 PM Udi Meiri  wrote:

> Hi,
> As part of the 2.18 release, we're adding 3 additional containers for
> Flink.
> I have write access but since I am not an owner I cannot create new repos.
>
> Could someone with access add these?
>
> flink1.7_job_server
> flink1.8_job_server
> flink1.9_job_server
>
> (go to https://hub.docker.com/repositories and select apachebeam from the
> dropdown)
>
> Thank you
>


Re: Poor Python 3.x performance on Dataflow?

2020-01-10 Thread Kamil Wasilewski
Our first Python3 performance test has just been implemented and we have
just started gathering results. Here[1] you can find dashboards with a
side-by-side comparison.
I also opened a Jira ticket to investigate the difference [2]. Anyone,
please feel free to assign it to yourself.

Thanks,
Kamil

[1]
https://apache-beam-testing.appspot.com/explore?dashboard=5678187241537536
[2] https://issues.apache.org/jira/browse/BEAM-9085

On Mon, Dec 9, 2019 at 8:38 PM Valentyn Tymofieiev 
wrote:

> For now we should run Py3 and Py2 tests alongside each other to get a
> side-by-side comparison. I suggest we open a Jira ticket to investigate the
> difference in performance . We have limited performance test coverage on
> Python 3 in Beam, so more Py3 tests would help a lot here, thanks for
> adding them.
>
> On Fri, Dec 6, 2019 at 9:43 AM Robert Bradshaw 
> wrote:
>
>> This is very surprising--I would expect the times to quite similar. Do
>> you have profiles for where the (difference in) time is spent? With
>> differences like these, I wonder if there are issues with container
>> setup (e.g. some things not being installed or cached) for Python 3.
>>
>> On Fri, Dec 6, 2019 at 9:06 AM Kamil Wasilewski
>>  wrote:
>> >
>> > Hi all,
>> >
>> > Python 2.7 won't be maintained past 2020 and that's why we want to
>> migrate all Python performance tests in Beam from Python 2.7 to Python 3.7.
>> However, I was surprised by seeing that after switching Dataflow tests to
>> Python 3.x they are a few times slower. For example, the same ParDo test
>> that takes approx. 8 minutes to run on Python 2.7 needs approx. 21 minutes
>> on Python 3.x. You can find all the results I gathered and the setup here.
>> >
>> > Do you know any possible reason for this? This issue makes it
>> impossible to do the migration, because of the limited resources on Jenkins
>> (almost every job would be aborted).
>> >
>> > Thanks,
>> > Kamil
>>
>