Re: Jenkins jobs not running for my PR 10438

2020-01-15 Thread Rehman Murad Ali
Hi,

I appreciate if someone can run the mentioned job for this PR.
https://github.com/apache/beam/pull/10316

Run Java Flink PortableValidatesRunner Streaming


*Thanks & Regards*



*Rehman Murad Ali*
Software Engineer
Mobile: +92 3452076766
Skype: rehman.muradali


On Thu, Jan 16, 2020 at 3:44 AM Andrew Pilloud  wrote:

> Done.
>
> Infra shut our .adf.yaml file off for being too large. Updates are here:
> https://issues.apache.org/jira/browse/INFRA-19670
>
> On Wed, Jan 15, 2020 at 2:40 PM Tomo Suzuki  wrote:
>
>> Hi Beam committers,
>>
>> Can somebody trigger the precommit cheeks for my new PR
>> https://github.com/apache/beam/pull/10603 ?
>>
>> This PR still does not trigger the checks. I confirmed that my account
>> is in the .adf.yaml.
>>
>> On Tue, Jan 14, 2020 at 9:48 PM Ahmet Altay  wrote:
>> >
>> > Done.
>> >
>> > +Kenneth Knowles, any updates from INFRA on this?
>> >
>> > On Tue, Jan 14, 2020 at 6:43 PM Tomo Suzuki  wrote:
>> >>
>> >> It hit Dataflow quota error again. Can somebody run
>> >> Run Dataflow ValidatesRunner
>> >> for https://github.com/apache/beam/pull/10554 ?
>> >>
>> >> On Tue, Jan 14, 2020 at 12:14 PM Tomo Suzuki 
>> wrote:
>> >> >
>> >> > Valentyn, thank you.
>> >> >
>> >> > On Tue, Jan 14, 2020 at 12:05 PM Valentyn Tymofieiev
>> >> >  wrote:
>> >> > >
>> >> > > Done. If tests still don't trigger, you could try to make a push
>> to the branch to reset the test status.
>> >> > >
>> >> > > On Tue, Jan 14, 2020 at 8:38 AM Tomo Suzuki 
>> wrote:
>> >> > >>
>> >> > >> Hi Beam developers,
>> >> > >>
>> >> > >> Can somebody run the following to
>> https://github.com/apache/beam/pull/10554 ?
>> >> > >> Run Dataflow ValidatesRunner
>> >> > >> Run Java PreCommit
>> >> > >>
>> >> > >> On Mon, Jan 13, 2020 at 2:35 PM Tomo Suzuki 
>> wrote:
>> >> > >> >
>> >> > >> > Thank you, Mark and Ismaël.
>> >> > >> >
>> >> > >> > On Mon, Jan 13, 2020 at 2:34 PM Mark Liu 
>> wrote:
>> >> > >> > >
>> >> > >> > > done
>> >> > >> > >
>> >> > >> > > On Mon, Jan 13, 2020 at 8:03 AM Tomo Suzuki <
>> suzt...@google.com> wrote:
>> >> > >> > >>
>> >> > >> > >> Thanks Yifan (but Java Precommit is still missing).
>> >> > >> > >> Can somebody run "Run Java PreCommit" on
>> >> > >> > >> https://github.com/apache/beam/pull/10554?
>> >> > >> > >>
>> >> > >> > >>
>> >> > >> > >> On Mon, Jan 13, 2020 at 2:59 AM Yifan Zou <
>> yifan...@google.com> wrote:
>> >> > >> > >> >
>> >> > >> > >> > done.
>> >> > >> > >> >
>> >> > >> > >> > On Sun, Jan 12, 2020 at 6:27 PM Tomo Suzuki <
>> suzt...@google.com> wrote:
>> >> > >> > >> >>
>> >> > >> > >> >> Hi Beam committers,
>> >> > >> > >> >>
>> >> > >> > >> >> Four Jenkins jobs did not report back for this PR
>> >> > >> > >> >> https://github.com/apache/beam/pull/10554 .
>> >> > >> > >> >> Can somebody trigger them?
>> >> > >> > >> >>
>> >> > >> > >> >> On Fri, Jan 10, 2020 at 4:51 PM Andrew Pilloud <
>> apill...@google.com> wrote:
>> >> > >> > >> >> >
>> >> > >> > >> >> > Done.
>> >> > >> > >> >> >
>> >> > >> > >> >> > On Fri, Jan 10, 2020 at 12:59 PM Tomo Suzuki <
>> suzt...@google.com> wrote:
>> >> > >> > >> >> >>
>> >> > >> > >> >> >> Hi Bean developers,
>> >> > >> > >> >> >>
>> >> > >> > >> >> >> I appreciate a committer can trigger precommit build
>> for
>> >> > >> > >> >> >> https://github.com/apache/beam/pull/10554.
>> >> > >> > >> >> >>
>> >> > >> > >> >> >> In addition to normal precommit checks, I want the
>> followings:
>> >> > >> > >> >> >> Run Java PostCommit
>> >> > >> > >> >> >> Run Java HadoopFormatIO Performance Test
>> >> > >> > >> >> >> Run BigQueryIO Streaming Performance Test Java
>> >> > >> > >> >> >> Run Dataflow ValidatesRunner
>> >> > >> > >> >> >> Run Spark ValidatesRunner
>> >> > >> > >> >> >> Run SQL Postcommit
>> >> > >> > >> >> >>
>> >> > >> > >> >> >> Regards,
>> >> > >> > >> >> >> Tomo
>> >> > >> > >> >>
>> >> > >> > >> >>
>> >> > >> > >> >>
>> >> > >> > >> >> --
>> >> > >> > >> >> Regards,
>> >> > >> > >> >> Tomo
>> >> > >> > >>
>> >> > >> > >>
>> >> > >> > >>
>> >> > >> > >> --
>> >> > >> > >> Regards,
>> >> > >> > >> Tomo
>> >> > >> >
>> >> > >> >
>> >> > >> >
>> >> > >> > --
>> >> > >> > Regards,
>> >> > >> > Tomo
>> >> > >>
>> >> > >>
>> >> > >>
>> >> > >> --
>> >> > >> Regards,
>> >> > >> Tomo
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Regards,
>> >> > Tomo
>> >>
>> >>
>> >>
>> >> --
>> >> Regards,
>> >> Tomo
>>
>>
>>
>> --
>> Regards,
>> Tomo
>>
>


Re: Beam's Avro 1.8.x dependency

2020-01-15 Thread jincheng sun
I found that there are several dependencies shaded and planned to made as
vendored artifacts in [1]. I'm not sure why Avro is not shaded before. From
my point of view, it's a good idea to shade Avro and make it a vendored
artifact if there are no special reasons blocking us to do that. Regarding
to how to create a vendored artifact, you can refer to [2] for more
details.

Best,
Jincheng

[1] https://issues.apache.org/jira/browse/BEAM-5819
[2] https://github.com/apache/beam/blob/master/vendor/README.md


Tomo Suzuki  于2020年1月16日周四 下午1:18写道:

> I've been upgrading dependencies around gRPC. This Avro-problem is
> interesting to me.
> I'll study BEAM-8388 more tomorrow.
>
> On Wed, Jan 15, 2020 at 10:51 PM Luke Cwik  wrote:
> >
> > +Tomo Suzuki +jincheng sun
> > There have been a few contributors upgrading the dependencies and
> validating things not breaking by running the majority of the post commit
> integration tests and also using the linkage checker to show that we aren't
> worse off with respect to our dependency tree. Reaching out to them to help
> your is your best bet of getting these upgrades through.
> >
> > On Wed, Jan 15, 2020 at 6:52 PM Aaron Dixon  wrote:
> >>
> >> I meant to mention that we must use Avro 1.9.x as we rely on some
> schema resolution fixes not present in 1.8.x - so am indeed blocked.
> >>
> >> On Wed, Jan 15, 2020 at 8:50 PM Aaron Dixon  wrote:
> >>>
> >>> It looks like Avro version dependency from Beam has come up in the
> past [1, 2].
> >>>
> >>> I'm currently on Beam 2.16.0, which has been compatible with my usage
> of Avro 1.9.x.
> >>>
> >>> But upgrading to Beam 2.17.0 is not possible for us now that 2.17.0
> has some dependencies on Avro classes only available in 1.8.x.
> >>>
> >>> Wondering if anyone else is similar blocked and what it would take to
> prioritize Beam upgrading to 1.9.x or better using a shaded version so that
> clients can use their own Avro version for their own coding purposes. (Eg,
> I parse Avro messages from a KafkaIO source and need 1.9.x for this but am
> perfectly happy if Beam's Avro coding facilities used a shaded other
> version.)
> >>>
> >>> I've made a comment on BEAM-8388 [1] to this effect. But polling
> community for discussion.
> >>>
> >>> [1] https://issues.apache.org/jira/browse/BEAM-8388
> >>> [2] https://github.com/apache/beam/pull/9779
> >>>
>
>
> --
> Regards,
> Tomo
>


Re: Beam's Avro 1.8.x dependency

2020-01-15 Thread Tomo Suzuki
I've been upgrading dependencies around gRPC. This Avro-problem is
interesting to me.
I'll study BEAM-8388 more tomorrow.

On Wed, Jan 15, 2020 at 10:51 PM Luke Cwik  wrote:
>
> +Tomo Suzuki +jincheng sun
> There have been a few contributors upgrading the dependencies and validating 
> things not breaking by running the majority of the post commit integration 
> tests and also using the linkage checker to show that we aren't worse off 
> with respect to our dependency tree. Reaching out to them to help your is 
> your best bet of getting these upgrades through.
>
> On Wed, Jan 15, 2020 at 6:52 PM Aaron Dixon  wrote:
>>
>> I meant to mention that we must use Avro 1.9.x as we rely on some schema 
>> resolution fixes not present in 1.8.x - so am indeed blocked.
>>
>> On Wed, Jan 15, 2020 at 8:50 PM Aaron Dixon  wrote:
>>>
>>> It looks like Avro version dependency from Beam has come up in the past [1, 
>>> 2].
>>>
>>> I'm currently on Beam 2.16.0, which has been compatible with my usage of 
>>> Avro 1.9.x.
>>>
>>> But upgrading to Beam 2.17.0 is not possible for us now that 2.17.0 has 
>>> some dependencies on Avro classes only available in 1.8.x.
>>>
>>> Wondering if anyone else is similar blocked and what it would take to 
>>> prioritize Beam upgrading to 1.9.x or better using a shaded version so that 
>>> clients can use their own Avro version for their own coding purposes. (Eg, 
>>> I parse Avro messages from a KafkaIO source and need 1.9.x for this but am 
>>> perfectly happy if Beam's Avro coding facilities used a shaded other 
>>> version.)
>>>
>>> I've made a comment on BEAM-8388 [1] to this effect. But polling community 
>>> for discussion.
>>>
>>> [1] https://issues.apache.org/jira/browse/BEAM-8388
>>> [2] https://github.com/apache/beam/pull/9779
>>>


-- 
Regards,
Tomo


Re: Beam's Avro 1.8.x dependency

2020-01-15 Thread Luke Cwik
+Tomo Suzuki  +jincheng sun 
There have been a few contributors upgrading the dependencies and
validating things not breaking by running the majority of the post commit
integration tests and also using the linkage checker to show that we aren't
worse off with respect to our dependency tree. Reaching out to them to help
your is your best bet of getting these upgrades through.

On Wed, Jan 15, 2020 at 6:52 PM Aaron Dixon  wrote:

> I meant to mention that we must use Avro 1.9.x as we rely on some schema
> resolution fixes not present in 1.8.x - so am indeed blocked.
>
> On Wed, Jan 15, 2020 at 8:50 PM Aaron Dixon  wrote:
>
>> It looks like Avro version dependency from Beam has come up in the past
>> [1, 2].
>>
>> I'm currently on Beam 2.16.0, which has been compatible with my usage of
>> Avro 1.9.x.
>>
>> But upgrading to Beam 2.17.0 is not possible for us now that 2.17.0 has
>> some dependencies on Avro classes only available in 1.8.x.
>>
>> Wondering if anyone else is similar blocked and what it would take to
>> prioritize Beam upgrading to 1.9.x or better using a shaded version so that
>> clients can use their own Avro version for their own coding purposes. (Eg,
>> I parse Avro messages from a KafkaIO source and need 1.9.x for this but am
>> perfectly happy if Beam's Avro coding facilities used a shaded other
>> version.)
>>
>> I've made a comment on BEAM-8388 [1] to this effect. But polling
>> community for discussion.
>>
>> [1] https://issues.apache.org/jira/browse/BEAM-8388
>> [2] https://github.com/apache/beam/pull/9779
>>
>>


DC Meetup

2020-01-15 Thread Austin Bennett
Hi All,

We have this meetup group:  https://www.meetup.com/DC-Apache-Beam/

Those that can talk, have space, etc etc do write, so we can help get
this community kicked off!

Cheers,
Austin


Re: [Proposal] Slowly Changing Dimensions and Distributed Map Side Inputs (in Dataflow)

2020-01-15 Thread Reza Rokni
+1 To this proposal, this is a very common pattern requirement from users.
With the following current workaround having seen a lot of traction:

https://beam.apache.org/documentation/patterns/side-inputs/#slowly-updating-global-window-side-inputs

Making this process simpler for users and Out Of the Box, would be a great
win!

I would also mention that ideally we will also cover the large distributed
side inputs, but a lot of the core cases for this comes down to Side inputs
that do fit in memory. Perhaps worth putting priorities on the work with
the smaller side input tables having precedence. Unless the work will cover
both cases in the same way of course.

Cheers

Reza

On Thu, 19 Dec 2019 at 07:14, Kenneth Knowles  wrote:

> I do think that the implementation concerns around larger side inputs are
> relevant to most runners. Ideally there would be no model change necessary.
> Triggers are harder and bring in consistency concerns, which are even more
> likely to be relevant to all runners.
>
> Kenn
>
> On Wed, Dec 18, 2019 at 11:23 AM Luke Cwik  wrote:
>
>> Most of the doc is about how to support distributed side inputs in
>> Dataflow and doesn't really cover how the Beam model (accumulating,
>> discarding, retraction) triggers impact what are the "contents" of a
>> PCollection in time and how this proposal for a limited set of side input
>> shapes can work to support larger side inputs in Dataflow.
>>
>> On Tue, Dec 17, 2019 at 2:28 AM Jan Lukavský  wrote:
>>
>>> Hi Mikhail,
>>> On 12/17/19 10:43 AM, Mikhail Gryzykhin wrote:
>>>
>>> inline
>>>
>>> On Tue, Dec 17, 2019 at 12:59 AM Jan Lukavský  wrote:
>>>
 Hi,

 I actually thought that the proposal refers to Dataflow only. If this
 is supposed to be general, can we remove the Dataflow/Windmill specific
 parts and replace them with generic ones?

>>>  I'll look into rephrasing doc to keep Dataflow/Windmill as example.
>>>
>>> Cool, thanks!
>>>
>>> I'd have two more questions:

  a) the proposal is named "Slowly changing", why is the rate of change
 essential to the proposal? Once running on event time, that should not
 matter, or what am I missing?

>>> Within this proposal, it is suggested to make a full snapshot of data on
>>> every re-read. This is generally expensive and setting time event to short
>>> interval might cause issues. Otherwise it is not essential.
>>>
>>> Understood. This relates to table-stream duality, where the requirements
>>> might relax once you don't have to convert table to stream by re-reading
>>> it, but by being able to retrieve updates as you go (example would be
>>> reading directly from kafka or any other "commit log" abstraction).
>>>
>>>  b) The description says: 'User wants to solve a stream enrichment
 problem. In brief request sounds like: ”I want to enrich each event in this
 stream by corresponding data from given table.”'. That is understandable,
 but would it be better to enable the user to express this intent directly
 (via Join operation)? The actual implementation might be runner (and
 input!) specific. The analogy is that when doing group-by-key operation,
 runner can choose hash grouping or sort-merge grouping, but that is not
 (directly) expressed in user code. I'm not saying that we should not have
 low-level transforms, just asking if it would be better to leave this
 decision to the runner (at least in some cases). It might be the case that
 we want to make core SDK as low level as possible (and as reasonable), I
 just want to make sure that that is really the intent.

>>> The idea is to add basic operation with as small change as possible for
>>> current API.
>>> Ultimate goal is to have a Join/GBK operator that will choose proper
>>> strategy. However, I don't think that we have proper tools and view of how
>>> to choose best strategy at hand as of yet.
>>>
>>> OK, cool. That is where I would find it very much useful to have some
>>> sort of "goals", that we are targeting. I agree that there are some pieces
>>> missing in the puzzle as of now. But it would be good to know what these
>>> pieces are and what needs to be done to fulfill our goals. But this is
>>> probably not related to discussion of this proposal, but more related to
>>> the concept of BIP or similar.
>>>
>>> Thanks for the explanation.
>>>
>>> Thanks for the proposal!

 Jan
 On 12/17/19 12:01 AM, Kenneth Knowles wrote:

 I want to highlight that this design works for definitely more runners
 than just Dataflow. I see two pieces of it that I want to bring onto the
 thread:

 1. A new kind of "unbounded source" which is a periodic refresh of a
 bounded source, and use that as a side input. Each main input element has a
 window that maps to a specific refresh of the side input.
 2. Distributed map side inputs: supporting very large lookup tables,
 but with consistency challenges. Even the part 

Re: Beam's Avro 1.8.x dependency

2020-01-15 Thread Aaron Dixon
I meant to mention that we must use Avro 1.9.x as we rely on some schema
resolution fixes not present in 1.8.x - so am indeed blocked.

On Wed, Jan 15, 2020 at 8:50 PM Aaron Dixon  wrote:

> It looks like Avro version dependency from Beam has come up in the past
> [1, 2].
>
> I'm currently on Beam 2.16.0, which has been compatible with my usage of
> Avro 1.9.x.
>
> But upgrading to Beam 2.17.0 is not possible for us now that 2.17.0 has
> some dependencies on Avro classes only available in 1.8.x.
>
> Wondering if anyone else is similar blocked and what it would take to
> prioritize Beam upgrading to 1.9.x or better using a shaded version so that
> clients can use their own Avro version for their own coding purposes. (Eg,
> I parse Avro messages from a KafkaIO source and need 1.9.x for this but am
> perfectly happy if Beam's Avro coding facilities used a shaded other
> version.)
>
> I've made a comment on BEAM-8388 [1] to this effect. But polling community
> for discussion.
>
> [1] https://issues.apache.org/jira/browse/BEAM-8388
> [2] https://github.com/apache/beam/pull/9779
>
>


Beam's Avro 1.8.x dependency

2020-01-15 Thread Aaron Dixon
It looks like Avro version dependency from Beam has come up in the past [1,
2].

I'm currently on Beam 2.16.0, which has been compatible with my usage of
Avro 1.9.x.

But upgrading to Beam 2.17.0 is not possible for us now that 2.17.0 has
some dependencies on Avro classes only available in 1.8.x.

Wondering if anyone else is similar blocked and what it would take to
prioritize Beam upgrading to 1.9.x or better using a shaded version so that
clients can use their own Avro version for their own coding purposes. (Eg,
I parse Avro messages from a KafkaIO source and need 1.9.x for this but am
perfectly happy if Beam's Avro coding facilities used a shaded other
version.)

I've made a comment on BEAM-8388 [1] to this effect. But polling community
for discussion.

[1] https://issues.apache.org/jira/browse/BEAM-8388
[2] https://github.com/apache/beam/pull/9779


Re: [BEAM-9015] Adding pyXX-cloud instead of pyXX-gcp and pyXX-aws

2020-01-15 Thread Udi Meiri
We would like to run unit tests using these dependencies (gcp+aws) in
presubmits.
Having separate tox environments for each would require running tox for
each, increasing presubmit time and duplicating work (since most tests
don't depend on aws or gcp).


On Wed, Jan 15, 2020 at 1:54 PM Kyle Weaver  wrote:

> Just now seeing this -- are we sure we want to mix the signal from what
> are logically two totally separate test suites?
> Or from the opposite perspective, what's the motivation for wanting one
> test suite instead of two?
>
> On Tue, Jan 14, 2020 at 3:25 PM Pablo Estrada  wrote:
>
>> now back from the holidays, I intend to do this - one of these days.
>>
>> On Thu, Dec 26, 2019 at 12:51 PM Udi Meiri  wrote:
>>
>>> +1
>>>
>>> On Mon, Dec 23, 2019, 17:28 Robert Bradshaw  wrote:
>>>
 Makes sense to me.

 On Mon, Dec 23, 2019 at 3:33 PM Pablo Estrada 
 wrote:
 >
 > Hi all,
 > a couple of contributors [1][2] have been kind enough to add support
 for s3 filesystem[3] for the Python SDK. Part of this involved adding a tox
 task called py37-aws, to install the relevant dependencies and run unit
 tests for it (in a mocked-out environment).
 >
 > To avoid running a full extra test suite, I thought we could add the
 new aws-related dependencies to the current pyXX-gcp suites, and perhaps
 rename to pyXX-cloud, to include all unit tests that require cloud-specific
 dependencies. What do others think?
 >
 > This is tracked here: https://jira.apache.org/jira/browse/BEAM-9015
 >
 > [1] https://github.com/tamera-lanham
 > [2] https://github.com/MattMorgis
 > [3] https://github.com/apache/beam/pull/9955

>>>


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [VOTE] Release 2.18.0, release candidate #1

2020-01-15 Thread Ahmet Altay
+1, validated the same things, they still work. Thank you.

On Wed, Jan 15, 2020 at 5:01 PM Udi Meiri  wrote:

> Dataflow containers have been updated. Test away.
>
> On Tue, Jan 14, 2020 at 6:37 PM Udi Meiri  wrote:
>
>> Here my second take:
>>
>> Hi everyone,
>> Please review and vote on the release candidate #1 for the version
>> 2.18.0, as follows:
>> [ ] +1, Approve the release
>> [ ] -1, Do not approve the release (please provide specific comments)
>>
>> The complete staging area is available for your review, which includes:
>> * JIRA release notes [1],
>> * the official Apache source release to be deployed to dist.apache.org
>> [2], which is signed with the key with fingerprint 8961 F3EF 8E79 6688 4067
>>  87CF 587B 049C 36DA AFE6 [3],
>> * all artifacts to be deployed to the Maven Central Repository [4],
>> * source code tag "v2.18.0-RC1" [5],
>> * website pull request listing the release [6], publishing the API
>> reference manual [7], and the blog post [8].
>> * Java artifacts were built with Maven N/A and OpenJDK
>> 1.8.0_181-google-v7.
>> * Python artifacts are deployed along with the source release to the
>> dist.apache.org [2].
>> * Validation sheet with a tab for 2.18.0 release to help with validation
>> [9].
>> * Docker images published to Docker Hub [10].
>>
>> The vote will be open for at least 72 hours. It is adopted by majority
>> approval, with at least 3 PMC affirmative votes.
>> NOTE: The vote will start once new Dataflow containers are built.
>>
>> Thanks,
>> Release Manager
>>
>> [1]
>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12346383=12319527
>> [2] https://dist.apache.org/repos/dist/dev/beam/2.18.0/
>> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>> [4]
>> https://repository.apache.org/content/repositories/orgapachebeam-1090/
>> [5] https://github.com/apache/beam/tree/v2.18.0-RC1
>> [6] https://github.com/apache/beam/pull/10574
>> [7] https://github.com/apache/beam-site/pull/595
>> [8] https://github.com/apache/beam/pull/10575
>> [9]
>> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=1178617819
>> [10] https://hub.docker.com/u/apachebeam
>>
>>
>> On Tue, Jan 14, 2020 at 6:34 PM Udi Meiri  wrote:
>>
>>> Please don't do any Dataflow-based verifications yet, because we'll have
>>> to redo them once new Dataflow containers are built.
>>>
>>> On Tue, Jan 14, 2020 at 6:27 PM Ahmet Altay  wrote:
>>>
 I verified python 2 quickstarts with batch and streaming pipelines,
 wheel files, and reviewed changes to the blog/website.

 Udi, could you send an updated version of the voting text with TODOs,
 template pieces removed? We can discuss changes to the template separately.
 My vote is +1 pending an updated vote text.

 On Tue, Jan 14, 2020 at 4:47 PM Udi Meiri  wrote:

> Sorry about the messiness.
> The links at the bottom should be correct though.
>
> I intentionally did not replace MAVEN_VERSION because I didn't know
> how to get it (I didn't execute mvn for the release).
> As for JDK_VERSION, do we still need that? (If so, what about Python
> versions, such as the ones used for testing?)
> javac -version on my machine is 1.8.0_181-google-v7
>

 I believe we can drop MAVEN_VERSION now that it is no longer used. I do
 not think it is needed to add a Gradle version either because the version
 itself is part of the repo anyway.

 I do not know if java, python etc. versions are helpful. Maybe others
 can comment. I would prefer to reduce the load on the release manager and
 drop this if this is not particularly important.


>
>
> On Mon, Jan 13, 2020 at 7:37 PM Valentyn Tymofieiev <
> valen...@google.com> wrote:
>
>> There are some issues in this message, part of the message is still a
>> template (1.2.3, TODO, MAVEN_VERSION).
>> Before I noticed these issues, I ran a few Batch and Streaming Python
>> 3.7 pipelines using Direct and Dataflow runners, and they all succeeded.
>>
>> On Mon, Jan 13, 2020 at 4:09 PM Udi Meiri  wrote:
>>
>>> Hi everyone,
>>> Please review and vote on the release candidate #3 for the version
>>> 1.2.3, as follows:
>>> [ ] +1, Approve the release
>>> [ ] -1, Do not approve the release (please provide specific comments)
>>>
>>>
>>> The complete staging area is available for your review, which
>>> includes:
>>> * JIRA release notes [1],
>>> * the official Apache source release to be deployed to
>>> dist.apache.org [2], which is signed with the key with fingerprint
>>> 8961 F3EF 8E79 6688 4067  87CF 587B 049C 36DA AFE6 [3],
>>> * all artifacts to be deployed to the Maven Central Repository [4],
>>> * source code tag "v1.2.3-RC3" [5],
>>>
>>
 Tag is "v2.18.0-RC1". This is correct in the referenced link.


> * website pull request listing the release [6], 

Re: [VOTE] Release 2.18.0, release candidate #1

2020-01-15 Thread Udi Meiri
Dataflow containers have been updated. Test away.

On Tue, Jan 14, 2020 at 6:37 PM Udi Meiri  wrote:

> Here my second take:
>
> Hi everyone,
> Please review and vote on the release candidate #1 for the version 2.18.0,
> as follows:
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
>
> The complete staging area is available for your review, which includes:
> * JIRA release notes [1],
> * the official Apache source release to be deployed to dist.apache.org
> [2], which is signed with the key with fingerprint 8961 F3EF 8E79 6688 4067
>  87CF 587B 049C 36DA AFE6 [3],
> * all artifacts to be deployed to the Maven Central Repository [4],
> * source code tag "v2.18.0-RC1" [5],
> * website pull request listing the release [6], publishing the API
> reference manual [7], and the blog post [8].
> * Java artifacts were built with Maven N/A and OpenJDK 1.8.0_181-google-v7.
> * Python artifacts are deployed along with the source release to the
> dist.apache.org [2].
> * Validation sheet with a tab for 2.18.0 release to help with validation
> [9].
> * Docker images published to Docker Hub [10].
>
> The vote will be open for at least 72 hours. It is adopted by majority
> approval, with at least 3 PMC affirmative votes.
> NOTE: The vote will start once new Dataflow containers are built.
>
> Thanks,
> Release Manager
>
> [1]
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12346383=12319527
> [2] https://dist.apache.org/repos/dist/dev/beam/2.18.0/
> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> [4] https://repository.apache.org/content/repositories/orgapachebeam-1090/
> [5] https://github.com/apache/beam/tree/v2.18.0-RC1
> [6] https://github.com/apache/beam/pull/10574
> [7] https://github.com/apache/beam-site/pull/595
> [8] https://github.com/apache/beam/pull/10575
> [9]
> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=1178617819
> [10] https://hub.docker.com/u/apachebeam
>
>
> On Tue, Jan 14, 2020 at 6:34 PM Udi Meiri  wrote:
>
>> Please don't do any Dataflow-based verifications yet, because we'll have
>> to redo them once new Dataflow containers are built.
>>
>> On Tue, Jan 14, 2020 at 6:27 PM Ahmet Altay  wrote:
>>
>>> I verified python 2 quickstarts with batch and streaming pipelines,
>>> wheel files, and reviewed changes to the blog/website.
>>>
>>> Udi, could you send an updated version of the voting text with TODOs,
>>> template pieces removed? We can discuss changes to the template separately.
>>> My vote is +1 pending an updated vote text.
>>>
>>> On Tue, Jan 14, 2020 at 4:47 PM Udi Meiri  wrote:
>>>
 Sorry about the messiness.
 The links at the bottom should be correct though.

 I intentionally did not replace MAVEN_VERSION because I didn't know how
 to get it (I didn't execute mvn for the release).
 As for JDK_VERSION, do we still need that? (If so, what about Python
 versions, such as the ones used for testing?)
 javac -version on my machine is 1.8.0_181-google-v7

>>>
>>> I believe we can drop MAVEN_VERSION now that it is no longer used. I do
>>> not think it is needed to add a Gradle version either because the version
>>> itself is part of the repo anyway.
>>>
>>> I do not know if java, python etc. versions are helpful. Maybe others
>>> can comment. I would prefer to reduce the load on the release manager and
>>> drop this if this is not particularly important.
>>>
>>>


 On Mon, Jan 13, 2020 at 7:37 PM Valentyn Tymofieiev <
 valen...@google.com> wrote:

> There are some issues in this message, part of the message is still a
> template (1.2.3, TODO, MAVEN_VERSION).
> Before I noticed these issues, I ran a few Batch and Streaming Python
> 3.7 pipelines using Direct and Dataflow runners, and they all succeeded.
>
> On Mon, Jan 13, 2020 at 4:09 PM Udi Meiri  wrote:
>
>> Hi everyone,
>> Please review and vote on the release candidate #3 for the version
>> 1.2.3, as follows:
>> [ ] +1, Approve the release
>> [ ] -1, Do not approve the release (please provide specific comments)
>>
>>
>> The complete staging area is available for your review, which
>> includes:
>> * JIRA release notes [1],
>> * the official Apache source release to be deployed to
>> dist.apache.org [2], which is signed with the key with fingerprint
>> 8961 F3EF 8E79 6688 4067  87CF 587B 049C 36DA AFE6 [3],
>> * all artifacts to be deployed to the Maven Central Repository [4],
>> * source code tag "v1.2.3-RC3" [5],
>>
>
>>> Tag is "v2.18.0-RC1". This is correct in the referenced link.
>>>
>>>
 * website pull request listing the release [6], publishing the API
>> reference manual [7], and the blog post [8].
>> * Java artifacts were built with Maven MAVEN_VERSION and
>> OpenJDK/Oracle JDK JDK_VERSION.
>> TODO: do these versions matter, and are they stamped 

Re: Ordering of element timestamp change and window function

2020-01-15 Thread Ankur Goenka
The case where a plan vanilla value or a windowed value is emitted seems as
expected as the user intent is honored without any surprises.

If I understand correctly in the case when timestamp is changed then
applying window function again can have unintended behavior in following
cases
* Custom windows: User code can be executed in unintended order.
* User emit a windowed value in a previous transform: Timestamping the
value in this case would overwrite the user assigned window in earlier step
even when the actual timestamp is the same. Semantically, emitting an
element or a timestamped value with the same timestamp should have the same
behaviour.

What do you think?


On Wed, Jan 15, 2020 at 4:04 PM Robert Bradshaw  wrote:

> If an element is emitted with a timestamp, the window assignment is
> re-applied at that time. At least that's how it is in Python. You can
> emit the full windowed value (accepted without checking...), a
> timestamped value (in which case the window will be computed), or a
> plain old element (in which case the window and timestamp will be
> computed (really, propagated)).
>
> On Wed, Jan 15, 2020 at 3:51 PM Ankur Goenka  wrote:
> >
> > Yup, This might result in unintended behavior as timestamp is changed
> after the window assignment as elements in windows do not have timestamp in
> the window time range.
> >
> > Shall we start validating atleast one window assignment between
> timestamp assignment and GBK/triggers to avoid unintended behaviors
> mentioned above?
> >
> > On Wed, Jan 15, 2020 at 1:24 PM Luke Cwik  wrote:
> >>
> >> Window assignment happens at the point in the pipeline the WindowInto
> transform was applied. So in this case the window would have been assigned
> using the original timestamp.
> >>
> >> Grouping is by key and window.
> >>
> >> On Tue, Jan 14, 2020 at 7:30 PM Ankur Goenka  wrote:
> >>>
> >>> Hi,
> >>>
> >>> I am not sure about the effect of the order of element timestamp
> change and window association has on a group by key.
> >>> More specifically, what would be the behavior if we apply window ->
> change element timestamp -> Group By key.
> >>> I think we should always apply window function after changing the
> timestamp of elements. Though this is neither checked nor a recommended
> practice in Beam.
> >>>
> >>> Example pipeline would look like this:
> >>>
> >>>   def applyTimestamp(value):
> >>> return window.TimestampedValue((key, value),
> int(time.time())
> >>>
> >>> p \
> >>> | 'Create' >> beam.Create(range(0, 10)) \
> >>> | 'Fixed Window' >>
> beam.WindowInto(window.FixedWindows(5)) \
> >>> | 'Apply Timestamp' >> beam.Map(applyTimestamp) \ #
> Timestamp is changed after windowing and before GBK
> >>> | 'Group By Key' >> beam.GroupByKey() \
> >>> | 'Print' >> beam.Map(print)
> >>>
> >>> Thanks,
> >>> Ankur
>


Re: NYC ? (or more generally East Coast)

2020-01-15 Thread Jennifer Melot
I'd be more than happy to contribute in any way to a DC meetup (would be
awesome if that existed!), including a talk on how we've been using Beam at
my org if that would be useful. I'm less confident about making an NYC
meetup but would love to stay in the loop anyway.

Jennifer

On Wed, Jan 15, 2020 at 6:09 PM Austin Bennett 
wrote:

> Awesome; writing directly to get down to specifics.
>
> Anyone else?
>
> On Mon, Jan 13, 2020 at 1:51 PM Suneel Marthi  wrote:
> >
> > I can do talks in either DC or NYC meetups.  I can coordinate with
> CapitalOne to see if they would be willing to host the DC meetup.
> >
> > On Mon, Jan 13, 2020 at 4:02 PM Austin Bennett <
> whatwouldausti...@gmail.com> wrote:
> >>
> >> Hi Devs and Users,
> >>
> >> We are looking for speakers for future Meetups and Events.  Who is
> >> building cool things with Beam?  We are looking at hosting a Meetup at
> >> Spotify in February, and ideally keep some meetups going throughout
> >> the year.  For this to occur, we need to hear about what people are
> >> working on!  Even if only a small/lightning talk, etc, do reach out!
> >> Let's figure something out.
> >>
> >> Cheers,
> >> Austin
> >>
> >> P.S.  https://www.meetup.com/New-York-Apache-Beam/
> >>
> >> P.S.S.  We also have budding communities in DC and Boston, will
> >> eventually write in separate threads on those.
>


-- 
None


Re: Ordering of element timestamp change and window function

2020-01-15 Thread Robert Bradshaw
If an element is emitted with a timestamp, the window assignment is
re-applied at that time. At least that's how it is in Python. You can
emit the full windowed value (accepted without checking...), a
timestamped value (in which case the window will be computed), or a
plain old element (in which case the window and timestamp will be
computed (really, propagated)).

On Wed, Jan 15, 2020 at 3:51 PM Ankur Goenka  wrote:
>
> Yup, This might result in unintended behavior as timestamp is changed after 
> the window assignment as elements in windows do not have timestamp in the 
> window time range.
>
> Shall we start validating atleast one window assignment between timestamp 
> assignment and GBK/triggers to avoid unintended behaviors mentioned above?
>
> On Wed, Jan 15, 2020 at 1:24 PM Luke Cwik  wrote:
>>
>> Window assignment happens at the point in the pipeline the WindowInto 
>> transform was applied. So in this case the window would have been assigned 
>> using the original timestamp.
>>
>> Grouping is by key and window.
>>
>> On Tue, Jan 14, 2020 at 7:30 PM Ankur Goenka  wrote:
>>>
>>> Hi,
>>>
>>> I am not sure about the effect of the order of element timestamp change and 
>>> window association has on a group by key.
>>> More specifically, what would be the behavior if we apply window -> change 
>>> element timestamp -> Group By key.
>>> I think we should always apply window function after changing the timestamp 
>>> of elements. Though this is neither checked nor a recommended practice in 
>>> Beam.
>>>
>>> Example pipeline would look like this:
>>>
>>>   def applyTimestamp(value):
>>> return window.TimestampedValue((key, value), int(time.time())
>>>
>>> p \
>>> | 'Create' >> beam.Create(range(0, 10)) \
>>> | 'Fixed Window' >> beam.WindowInto(window.FixedWindows(5)) \
>>> | 'Apply Timestamp' >> beam.Map(applyTimestamp) \ # Timestamp 
>>> is changed after windowing and before GBK
>>> | 'Group By Key' >> beam.GroupByKey() \
>>> | 'Print' >> beam.Map(print)
>>>
>>> Thanks,
>>> Ankur


Re: Ordering of element timestamp change and window function

2020-01-15 Thread Ankur Goenka
Yup, This might result in unintended behavior as timestamp is changed after
the window assignment as elements in windows do not have timestamp in the
window time range.

Shall we start validating atleast one window assignment between timestamp
assignment and GBK/triggers to avoid unintended behaviors mentioned above?

On Wed, Jan 15, 2020 at 1:24 PM Luke Cwik  wrote:

> Window assignment happens at the point in the pipeline the WindowInto
> transform was applied. So in this case the window would have been assigned
> using the original timestamp.
>
> Grouping is by key and window.
>
> On Tue, Jan 14, 2020 at 7:30 PM Ankur Goenka  wrote:
>
>> Hi,
>>
>> I am not sure about the effect of the order of element timestamp change
>> and window association has on a group by key.
>> More specifically, what would be the behavior if we apply window ->
>> change element timestamp -> Group By key.
>> I think we should always apply window function after changing the
>> timestamp of elements. Though this is neither checked nor a recommended
>> practice in Beam.
>>
>> Example pipeline would look like this:
>>
>>   def applyTimestamp(value):
>> return window.TimestampedValue((key, value), int(time.time())
>>
>> p \
>> | 'Create' >> beam.Create(range(0, 10)) \
>> | 'Fixed Window' >> beam.WindowInto(window.FixedWindows(5)) \
>> | 'Apply Timestamp' >> beam.Map(applyTimestamp) \ # Timestamp
>> is changed after windowing and before GBK
>> | 'Group By Key' >> beam.GroupByKey() \
>> | 'Print' >> beam.Map(print)
>>
>> Thanks,
>> Ankur
>>
>


Re: NYC ? (or more generally East Coast)

2020-01-15 Thread Austin Bennett
Awesome; writing directly to get down to specifics.

Anyone else?

On Mon, Jan 13, 2020 at 1:51 PM Suneel Marthi  wrote:
>
> I can do talks in either DC or NYC meetups.  I can coordinate with CapitalOne 
> to see if they would be willing to host the DC meetup.
>
> On Mon, Jan 13, 2020 at 4:02 PM Austin Bennett  
> wrote:
>>
>> Hi Devs and Users,
>>
>> We are looking for speakers for future Meetups and Events.  Who is
>> building cool things with Beam?  We are looking at hosting a Meetup at
>> Spotify in February, and ideally keep some meetups going throughout
>> the year.  For this to occur, we need to hear about what people are
>> working on!  Even if only a small/lightning talk, etc, do reach out!
>> Let's figure something out.
>>
>> Cheers,
>> Austin
>>
>> P.S.  https://www.meetup.com/New-York-Apache-Beam/
>>
>> P.S.S.  We also have budding communities in DC and Boston, will
>> eventually write in separate threads on those.


Re: Please comment on draft comms strategy by Oct 16

2020-01-15 Thread Austin Bennett
Hi Kenn,

We had workshop on this very topic (how to contribute to Beam) at our
Berlin Summit:  https://www.youtube.com/watch?v=PtPslSdAPcM There's is
certainly room for me (or anyone) to cleanup and formalize that a bit
more.  Though, the views of that are relatively small, which either indeed
points to a lack of appetite and/or that it wasn't well publicized (I
suspect both).

Cheers,
Austin

On Thu, Jan 9, 2020 at 6:53 PM Kenneth Knowles  wrote:

> Wow, this is great work. I looked at the graphical maps when you sent them
> but forgot to reply on thread. They really distill a lot of possibilities
> and help to think about the current state.
>
> These three action items seem good and doable. Thanks for highlighting
> those. The only one that isn't obvious to me is "workshop on how to
> contribute to Beam". Is there enough appetite / audience to make this a
> workshop? What forms could this take? A live coding demonstration in a
> normal talk slot at an OSS or data conference seems like a possibility.
> Whatever we do, we should record and distribute for sure, because when
> someone wants to contribute, they need to find the resources at that moment.
>
> Kenn
>
> On Wed, Jan 8, 2020 at 1:01 PM María Cruz  wrote:
>
>> Hi everyone,
>> I'm writing to send an update about the communication strategy for Beam.
>> In a nutshell, I have 3 proposed changes (copied from the md file here:
>> https://github.com/macruzbar/beam/blob/master/Communication-strategy-DRAFT.md
>> ).
>>
>> While all the channels are connected to a specific function in the short,
>> medium, and long term, some areas have redundancies, and some other areas
>> could use more exposure. In order to continue to grow the project, there
>> are 3 proposals we need to focus on (click on the link on each section to
>> read more):
>>
>> 1. Blog post categories, frequency and distribution.
>> 
>>  Reduce
>> to one blogging space (on Beam Website). Incorporate 3 categories to the
>> blog: Apache Beam summit, Apache Beam use cases, and Your journey as a
>> contributor.
>>
>> 2. Develop more in-person and digital workshops.
>> 
>>  Two
>> workshop types: how to use Beam, and how to contribute to Beam.
>>
>> 3. Increase distribution of tech talks.
>> 
>>
>>- Embed tech talks in the Beam website, and the Beam Summit website,
>>- Share talks on @ApacheBeam Twitter handle
>>- Curate a list of talks by topic, and write blog posts to share
>>curated talks (1 blog every 3 months),
>>- Distribute copy via email to users@ and dev@ mailing lists.
>>- Create playlists on YouTube channel. Create one view for subscribed
>>users (featuring latest content), and one for non-subscribed users
>>(featuring most popular content). One playlist per event, plus 1 playlist
>>for use cases.
>>
>> The communications plan also has a section about what to do when there is
>> new content about Apache Beam:
>>
>>
>> Where should I share new content about Apache Beam?
>>
>> Main outreach channels
>>
>> Beam Website
>>
>> Beam website blog
>>
>> Twitter handles (x2)
>>
>> Mailing lists (x2)
>> Blogpost
>>
>> ✓
>>
>> ✓
>>
>> ✓
>>
>> ✓
>> (share first paragraph in the
>> email, and link to blog)
>> Tech talk
>>
>> ✓
>> (special section
>> for tech talks)
>>
>> ✓
>> (curate tech talks
>> quarterly in a blog)
>>
>> ✓
>>
>> ✓
>> (share blogs on a quarterly
>> basis when published)
>> Trainings
>>
>> ✓
>> (under section “Get started”)
>>
>> ✓
>>
>> ✓
>>
>> If you have any questions or comments about this logic model, please add
>> them here .
>>
>> I'm happy to support in some of the proposed changes if there is interest
>> in moving forward with them. If there aren't any further changes requested,
>> I will proceed to add these files through a pull request.
>>
>>
>> Pablo, thank you for the feedback. Some comments in line.
>>
>> On Mon, Dec 16, 2019 at 4:31 PM Pablo Estrada  wrote:
>>
>>> Hi Maria,
>>> thanks for putting this together. It's large, so a bit hard to follow -
>>> but that's probably just an indication of the complexity of Beam's
>>> communication channels.
>>> FWIW, I found that reading from the top down was useful (or rather, from
>>> the top, backwards from the arrows).
>>>
>>> It is an interesting logical map. I am curious to see what we can derive
>>> from it : )
>>>
>>> My comments:
>>> - All of the factors that contribute to project transparency, when put
>>> together, add up to increased use of Beam, but I feel like 'increased
>>> knowledge of how to use beam' is the single largest contributor to that. Is
>>> 

Re: Jenkins jobs not running for my PR 10438

2020-01-15 Thread Andrew Pilloud
Done.

Infra shut our .adf.yaml file off for being too large. Updates are here:
https://issues.apache.org/jira/browse/INFRA-19670

On Wed, Jan 15, 2020 at 2:40 PM Tomo Suzuki  wrote:

> Hi Beam committers,
>
> Can somebody trigger the precommit cheeks for my new PR
> https://github.com/apache/beam/pull/10603 ?
>
> This PR still does not trigger the checks. I confirmed that my account
> is in the .adf.yaml.
>
> On Tue, Jan 14, 2020 at 9:48 PM Ahmet Altay  wrote:
> >
> > Done.
> >
> > +Kenneth Knowles, any updates from INFRA on this?
> >
> > On Tue, Jan 14, 2020 at 6:43 PM Tomo Suzuki  wrote:
> >>
> >> It hit Dataflow quota error again. Can somebody run
> >> Run Dataflow ValidatesRunner
> >> for https://github.com/apache/beam/pull/10554 ?
> >>
> >> On Tue, Jan 14, 2020 at 12:14 PM Tomo Suzuki 
> wrote:
> >> >
> >> > Valentyn, thank you.
> >> >
> >> > On Tue, Jan 14, 2020 at 12:05 PM Valentyn Tymofieiev
> >> >  wrote:
> >> > >
> >> > > Done. If tests still don't trigger, you could try to make a push to
> the branch to reset the test status.
> >> > >
> >> > > On Tue, Jan 14, 2020 at 8:38 AM Tomo Suzuki 
> wrote:
> >> > >>
> >> > >> Hi Beam developers,
> >> > >>
> >> > >> Can somebody run the following to
> https://github.com/apache/beam/pull/10554 ?
> >> > >> Run Dataflow ValidatesRunner
> >> > >> Run Java PreCommit
> >> > >>
> >> > >> On Mon, Jan 13, 2020 at 2:35 PM Tomo Suzuki 
> wrote:
> >> > >> >
> >> > >> > Thank you, Mark and Ismaël.
> >> > >> >
> >> > >> > On Mon, Jan 13, 2020 at 2:34 PM Mark Liu 
> wrote:
> >> > >> > >
> >> > >> > > done
> >> > >> > >
> >> > >> > > On Mon, Jan 13, 2020 at 8:03 AM Tomo Suzuki <
> suzt...@google.com> wrote:
> >> > >> > >>
> >> > >> > >> Thanks Yifan (but Java Precommit is still missing).
> >> > >> > >> Can somebody run "Run Java PreCommit" on
> >> > >> > >> https://github.com/apache/beam/pull/10554?
> >> > >> > >>
> >> > >> > >>
> >> > >> > >> On Mon, Jan 13, 2020 at 2:59 AM Yifan Zou <
> yifan...@google.com> wrote:
> >> > >> > >> >
> >> > >> > >> > done.
> >> > >> > >> >
> >> > >> > >> > On Sun, Jan 12, 2020 at 6:27 PM Tomo Suzuki <
> suzt...@google.com> wrote:
> >> > >> > >> >>
> >> > >> > >> >> Hi Beam committers,
> >> > >> > >> >>
> >> > >> > >> >> Four Jenkins jobs did not report back for this PR
> >> > >> > >> >> https://github.com/apache/beam/pull/10554 .
> >> > >> > >> >> Can somebody trigger them?
> >> > >> > >> >>
> >> > >> > >> >> On Fri, Jan 10, 2020 at 4:51 PM Andrew Pilloud <
> apill...@google.com> wrote:
> >> > >> > >> >> >
> >> > >> > >> >> > Done.
> >> > >> > >> >> >
> >> > >> > >> >> > On Fri, Jan 10, 2020 at 12:59 PM Tomo Suzuki <
> suzt...@google.com> wrote:
> >> > >> > >> >> >>
> >> > >> > >> >> >> Hi Bean developers,
> >> > >> > >> >> >>
> >> > >> > >> >> >> I appreciate a committer can trigger precommit build for
> >> > >> > >> >> >> https://github.com/apache/beam/pull/10554.
> >> > >> > >> >> >>
> >> > >> > >> >> >> In addition to normal precommit checks, I want the
> followings:
> >> > >> > >> >> >> Run Java PostCommit
> >> > >> > >> >> >> Run Java HadoopFormatIO Performance Test
> >> > >> > >> >> >> Run BigQueryIO Streaming Performance Test Java
> >> > >> > >> >> >> Run Dataflow ValidatesRunner
> >> > >> > >> >> >> Run Spark ValidatesRunner
> >> > >> > >> >> >> Run SQL Postcommit
> >> > >> > >> >> >>
> >> > >> > >> >> >> Regards,
> >> > >> > >> >> >> Tomo
> >> > >> > >> >>
> >> > >> > >> >>
> >> > >> > >> >>
> >> > >> > >> >> --
> >> > >> > >> >> Regards,
> >> > >> > >> >> Tomo
> >> > >> > >>
> >> > >> > >>
> >> > >> > >>
> >> > >> > >> --
> >> > >> > >> Regards,
> >> > >> > >> Tomo
> >> > >> >
> >> > >> >
> >> > >> >
> >> > >> > --
> >> > >> > Regards,
> >> > >> > Tomo
> >> > >>
> >> > >>
> >> > >>
> >> > >> --
> >> > >> Regards,
> >> > >> Tomo
> >> >
> >> >
> >> >
> >> > --
> >> > Regards,
> >> > Tomo
> >>
> >>
> >>
> >> --
> >> Regards,
> >> Tomo
>
>
>
> --
> Regards,
> Tomo
>


Re: Jenkins jobs not running for my PR 10438

2020-01-15 Thread Tomo Suzuki
Hi Beam committers,

Can somebody trigger the precommit cheeks for my new PR
https://github.com/apache/beam/pull/10603 ?

This PR still does not trigger the checks. I confirmed that my account
is in the .adf.yaml.

On Tue, Jan 14, 2020 at 9:48 PM Ahmet Altay  wrote:
>
> Done.
>
> +Kenneth Knowles, any updates from INFRA on this?
>
> On Tue, Jan 14, 2020 at 6:43 PM Tomo Suzuki  wrote:
>>
>> It hit Dataflow quota error again. Can somebody run
>> Run Dataflow ValidatesRunner
>> for https://github.com/apache/beam/pull/10554 ?
>>
>> On Tue, Jan 14, 2020 at 12:14 PM Tomo Suzuki  wrote:
>> >
>> > Valentyn, thank you.
>> >
>> > On Tue, Jan 14, 2020 at 12:05 PM Valentyn Tymofieiev
>> >  wrote:
>> > >
>> > > Done. If tests still don't trigger, you could try to make a push to the 
>> > > branch to reset the test status.
>> > >
>> > > On Tue, Jan 14, 2020 at 8:38 AM Tomo Suzuki  wrote:
>> > >>
>> > >> Hi Beam developers,
>> > >>
>> > >> Can somebody run the following to 
>> > >> https://github.com/apache/beam/pull/10554 ?
>> > >> Run Dataflow ValidatesRunner
>> > >> Run Java PreCommit
>> > >>
>> > >> On Mon, Jan 13, 2020 at 2:35 PM Tomo Suzuki  wrote:
>> > >> >
>> > >> > Thank you, Mark and Ismaël.
>> > >> >
>> > >> > On Mon, Jan 13, 2020 at 2:34 PM Mark Liu  wrote:
>> > >> > >
>> > >> > > done
>> > >> > >
>> > >> > > On Mon, Jan 13, 2020 at 8:03 AM Tomo Suzuki  
>> > >> > > wrote:
>> > >> > >>
>> > >> > >> Thanks Yifan (but Java Precommit is still missing).
>> > >> > >> Can somebody run "Run Java PreCommit" on
>> > >> > >> https://github.com/apache/beam/pull/10554?
>> > >> > >>
>> > >> > >>
>> > >> > >> On Mon, Jan 13, 2020 at 2:59 AM Yifan Zou  
>> > >> > >> wrote:
>> > >> > >> >
>> > >> > >> > done.
>> > >> > >> >
>> > >> > >> > On Sun, Jan 12, 2020 at 6:27 PM Tomo Suzuki  
>> > >> > >> > wrote:
>> > >> > >> >>
>> > >> > >> >> Hi Beam committers,
>> > >> > >> >>
>> > >> > >> >> Four Jenkins jobs did not report back for this PR
>> > >> > >> >> https://github.com/apache/beam/pull/10554 .
>> > >> > >> >> Can somebody trigger them?
>> > >> > >> >>
>> > >> > >> >> On Fri, Jan 10, 2020 at 4:51 PM Andrew Pilloud 
>> > >> > >> >>  wrote:
>> > >> > >> >> >
>> > >> > >> >> > Done.
>> > >> > >> >> >
>> > >> > >> >> > On Fri, Jan 10, 2020 at 12:59 PM Tomo Suzuki 
>> > >> > >> >> >  wrote:
>> > >> > >> >> >>
>> > >> > >> >> >> Hi Bean developers,
>> > >> > >> >> >>
>> > >> > >> >> >> I appreciate a committer can trigger precommit build for
>> > >> > >> >> >> https://github.com/apache/beam/pull/10554.
>> > >> > >> >> >>
>> > >> > >> >> >> In addition to normal precommit checks, I want the 
>> > >> > >> >> >> followings:
>> > >> > >> >> >> Run Java PostCommit
>> > >> > >> >> >> Run Java HadoopFormatIO Performance Test
>> > >> > >> >> >> Run BigQueryIO Streaming Performance Test Java
>> > >> > >> >> >> Run Dataflow ValidatesRunner
>> > >> > >> >> >> Run Spark ValidatesRunner
>> > >> > >> >> >> Run SQL Postcommit
>> > >> > >> >> >>
>> > >> > >> >> >> Regards,
>> > >> > >> >> >> Tomo
>> > >> > >> >>
>> > >> > >> >>
>> > >> > >> >>
>> > >> > >> >> --
>> > >> > >> >> Regards,
>> > >> > >> >> Tomo
>> > >> > >>
>> > >> > >>
>> > >> > >>
>> > >> > >> --
>> > >> > >> Regards,
>> > >> > >> Tomo
>> > >> >
>> > >> >
>> > >> >
>> > >> > --
>> > >> > Regards,
>> > >> > Tomo
>> > >>
>> > >>
>> > >>
>> > >> --
>> > >> Regards,
>> > >> Tomo
>> >
>> >
>> >
>> > --
>> > Regards,
>> > Tomo
>>
>>
>>
>> --
>> Regards,
>> Tomo



-- 
Regards,
Tomo


Re: [PROPOSAL] Transition released containers to the official ASF dockerhub organization

2020-01-15 Thread Ahmet Altay
+1 - Thank you for driving this!

On Wed, Jan 15, 2020 at 1:55 PM Thomas Weise  wrote:

> +1 for the namespace proposal.
>
> It is similar to github repos. Top-level is the org, then single level for
> repo (beam-abc, beam-xzy, ..)
>
>
>
> On Wed, Jan 15, 2020 at 1:45 PM Robert Bradshaw 
> wrote:
>
>> Various tags of the same image should at least logically be the same
>> thing, so I agree that we should not be trying to share a single
>> repository in that way. A full suite of apache/beam-{image_desc}
>> repositories, if apache is fine with that, seems like the best
>> approach.
>>
>> On Wed, Jan 15, 2020 at 1:32 PM Kyle Weaver  wrote:
>> >
>> > +1, agree that moving current image name to tags is a non-starter.
>> Thanks for driving this Hannah. Let us know what they say about repo
>> creation.
>> >
>> > On Wed, Jan 15, 2020 at 1:16 PM Udi Meiri  wrote:
>> >>
>> >> SG +1
>> >>
>> >> On Wed, Jan 15, 2020 at 12:59 PM Hannah Jiang 
>> wrote:
>> >>>
>> >>> I have done some research about images released under apache
>> namespace at docker hub, and here is my proposal.
>> >>>
>> >>> Currently, we are using apachebeam as our namespace and each image
>> has its own repository. Version number is used to tag the images.
>> >>> ie: apachebeam/python2.7_sdk:2.19.0,
>> apachebeam/flink1.9_job_server:2.19.0
>> >>>
>> >>> Now we are migrating to apache namespace and docker hub doesn't
>> support nested repository names, so we cannot use
>> apache/beam/{image-desc}:{version}.
>> >>> Instead, I propose to use apache/beam-{image_desc}:{version} as our
>> repository name.
>> >>> ie: apache/beam-python2.7_sdk:2.19.0,
>> apache/beam-flink1.9_job_server:2.19.0
>> >>> => When a user searches for apache/beam at docker hub, it will list
>> all the repositories we deployed with apache/beam-, so no concerns that
>> some released images are missed by users.
>> >>> => Repository names give insights to the users which repositories
>> they should use.
>> >>> => A downside with this approach is we need to create a new
>> repository whenever we release a new image, time and effort needed for this
>> is pending, I am contacting apache docker hub management team.
>> >>>
>> >>> I have considered using beam as repository name and moving image name
>> and version to tags, (ie: apache/beam:python3.7_sdk_2.19.0), which means
>> put all images to a single repository, however, this approach has some
>> downsides.
>> >>> => When a user searches for apache/beam, only one repository is
>> returned. Users need to use tags to identify which images they should use.
>> Since we release images with new tags for each version, it will overwhelm
>> the users and give them an impression that the images are not organized
>> well. It's also difficult to know what kind of images we deployed.
>> >>> => With both image name and version included at tags, it is a little
>> bit more complicated to maintain the code.
>> >>> => There is no correct answer which image the latest tag should point
>> to.
>> >>>
>> >>> Are there any concerns with this proposal?
>> >>>
>> >>> Thanks,
>> >>> Hannah
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On Fri, Jan 10, 2020 at 4:19 PM Ahmet Altay  wrote:
>> 
>> 
>> 
>>  On Fri, Jan 10, 2020 at 3:33 PM Ahmet Altay 
>> wrote:
>> >
>> >
>> >
>> > On Fri, Jan 10, 2020 at 3:32 PM Ankur Goenka 
>> wrote:
>> >>
>> >> Also curious to know if apache provide any infra support fro
>> projects under Apache umbrella and any quota limits they might have.
>> 
>> 
>>  Maybe Hannah can ask with an infra ticket?
>> 
>> >>
>> >>
>> >> On Fri, Jan 10, 2020, 2:26 PM Robert Bradshaw 
>> wrote:
>> >>>
>> >>> One downside is that, unlike many of these projects, we release a
>> >>> dozen or so containers. Is there exactly (and only) one level of
>> >>> namespacing/nesting we can leverage here? (This isn't a blocker,
>> but
>> >>> something to consider.)
>> >
>> >
>> > After a quick search, I could not find a way to use more than one
>> level of repositories. We can use the naming scheme we currently use to
>> help with. Our repositories are named as apachebeam/X, we could start using
>> apache/beam/X.
>> >
>> >>>
>> >>>
>> >>> On Fri, Jan 10, 2020 at 2:06 PM Hannah Jiang <
>> hannahji...@google.com> wrote:
>> >>> >
>> >>> > Thanks Ahmet for proposing it.
>> >>> > I will take it and work towards v2.19.
>> 
>> 
>>  Missed this part. Thank you Hannah!
>> 
>> >>>
>> >>> >
>> >>> > Hannah
>> >>> >
>> >>> > On Fri, Jan 10, 2020 at 1:50 PM Kyle Weaver <
>> kcwea...@google.com> wrote:
>> >>> >>
>> >>> >> It'd be nice to have the clout/official sheen of apache
>> attached to our containers. Although getting the required permissions might
>> add some small overhead to the release process. For example, yesterday,
>> when we needed to create new repositories (not just update existing ones),
>> since we have 

Re: [PROPOSAL] Transition released containers to the official ASF dockerhub organization

2020-01-15 Thread Thomas Weise
+1 for the namespace proposal.

It is similar to github repos. Top-level is the org, then single level for
repo (beam-abc, beam-xzy, ..)



On Wed, Jan 15, 2020 at 1:45 PM Robert Bradshaw  wrote:

> Various tags of the same image should at least logically be the same
> thing, so I agree that we should not be trying to share a single
> repository in that way. A full suite of apache/beam-{image_desc}
> repositories, if apache is fine with that, seems like the best
> approach.
>
> On Wed, Jan 15, 2020 at 1:32 PM Kyle Weaver  wrote:
> >
> > +1, agree that moving current image name to tags is a non-starter.
> Thanks for driving this Hannah. Let us know what they say about repo
> creation.
> >
> > On Wed, Jan 15, 2020 at 1:16 PM Udi Meiri  wrote:
> >>
> >> SG +1
> >>
> >> On Wed, Jan 15, 2020 at 12:59 PM Hannah Jiang 
> wrote:
> >>>
> >>> I have done some research about images released under apache namespace
> at docker hub, and here is my proposal.
> >>>
> >>> Currently, we are using apachebeam as our namespace and each image has
> its own repository. Version number is used to tag the images.
> >>> ie: apachebeam/python2.7_sdk:2.19.0,
> apachebeam/flink1.9_job_server:2.19.0
> >>>
> >>> Now we are migrating to apache namespace and docker hub doesn't
> support nested repository names, so we cannot use
> apache/beam/{image-desc}:{version}.
> >>> Instead, I propose to use apache/beam-{image_desc}:{version} as our
> repository name.
> >>> ie: apache/beam-python2.7_sdk:2.19.0,
> apache/beam-flink1.9_job_server:2.19.0
> >>> => When a user searches for apache/beam at docker hub, it will list
> all the repositories we deployed with apache/beam-, so no concerns that
> some released images are missed by users.
> >>> => Repository names give insights to the users which repositories they
> should use.
> >>> => A downside with this approach is we need to create a new repository
> whenever we release a new image, time and effort needed for this is
> pending, I am contacting apache docker hub management team.
> >>>
> >>> I have considered using beam as repository name and moving image name
> and version to tags, (ie: apache/beam:python3.7_sdk_2.19.0), which means
> put all images to a single repository, however, this approach has some
> downsides.
> >>> => When a user searches for apache/beam, only one repository is
> returned. Users need to use tags to identify which images they should use.
> Since we release images with new tags for each version, it will overwhelm
> the users and give them an impression that the images are not organized
> well. It's also difficult to know what kind of images we deployed.
> >>> => With both image name and version included at tags, it is a little
> bit more complicated to maintain the code.
> >>> => There is no correct answer which image the latest tag should point
> to.
> >>>
> >>> Are there any concerns with this proposal?
> >>>
> >>> Thanks,
> >>> Hannah
> >>>
> >>>
> >>>
> >>>
> >>> On Fri, Jan 10, 2020 at 4:19 PM Ahmet Altay  wrote:
> 
> 
> 
>  On Fri, Jan 10, 2020 at 3:33 PM Ahmet Altay  wrote:
> >
> >
> >
> > On Fri, Jan 10, 2020 at 3:32 PM Ankur Goenka 
> wrote:
> >>
> >> Also curious to know if apache provide any infra support fro
> projects under Apache umbrella and any quota limits they might have.
> 
> 
>  Maybe Hannah can ask with an infra ticket?
> 
> >>
> >>
> >> On Fri, Jan 10, 2020, 2:26 PM Robert Bradshaw 
> wrote:
> >>>
> >>> One downside is that, unlike many of these projects, we release a
> >>> dozen or so containers. Is there exactly (and only) one level of
> >>> namespacing/nesting we can leverage here? (This isn't a blocker,
> but
> >>> something to consider.)
> >
> >
> > After a quick search, I could not find a way to use more than one
> level of repositories. We can use the naming scheme we currently use to
> help with. Our repositories are named as apachebeam/X, we could start using
> apache/beam/X.
> >
> >>>
> >>>
> >>> On Fri, Jan 10, 2020 at 2:06 PM Hannah Jiang <
> hannahji...@google.com> wrote:
> >>> >
> >>> > Thanks Ahmet for proposing it.
> >>> > I will take it and work towards v2.19.
> 
> 
>  Missed this part. Thank you Hannah!
> 
> >>>
> >>> >
> >>> > Hannah
> >>> >
> >>> > On Fri, Jan 10, 2020 at 1:50 PM Kyle Weaver 
> wrote:
> >>> >>
> >>> >> It'd be nice to have the clout/official sheen of apache
> attached to our containers. Although getting the required permissions might
> add some small overhead to the release process. For example, yesterday,
> when we needed to create new repositories (not just update existing ones),
> since we have top-level ownership of the apachebeam organization, it was
> quick and easy to add them. I imagine we'd have had to get approval from
> someone outside the project to do that under the apache org. But this won't
> need to happen very often, so it's 

Re: [BEAM-9015] Adding pyXX-cloud instead of pyXX-gcp and pyXX-aws

2020-01-15 Thread Kyle Weaver
Just now seeing this -- are we sure we want to mix the signal from what are
logically two totally separate test suites?
Or from the opposite perspective, what's the motivation for wanting one
test suite instead of two?

On Tue, Jan 14, 2020 at 3:25 PM Pablo Estrada  wrote:

> now back from the holidays, I intend to do this - one of these days.
>
> On Thu, Dec 26, 2019 at 12:51 PM Udi Meiri  wrote:
>
>> +1
>>
>> On Mon, Dec 23, 2019, 17:28 Robert Bradshaw  wrote:
>>
>>> Makes sense to me.
>>>
>>> On Mon, Dec 23, 2019 at 3:33 PM Pablo Estrada 
>>> wrote:
>>> >
>>> > Hi all,
>>> > a couple of contributors [1][2] have been kind enough to add support
>>> for s3 filesystem[3] for the Python SDK. Part of this involved adding a tox
>>> task called py37-aws, to install the relevant dependencies and run unit
>>> tests for it (in a mocked-out environment).
>>> >
>>> > To avoid running a full extra test suite, I thought we could add the
>>> new aws-related dependencies to the current pyXX-gcp suites, and perhaps
>>> rename to pyXX-cloud, to include all unit tests that require cloud-specific
>>> dependencies. What do others think?
>>> >
>>> > This is tracked here: https://jira.apache.org/jira/browse/BEAM-9015
>>> >
>>> > [1] https://github.com/tamera-lanham
>>> > [2] https://github.com/MattMorgis
>>> > [3] https://github.com/apache/beam/pull/9955
>>>
>>


Re: [PROPOSAL] Transition released containers to the official ASF dockerhub organization

2020-01-15 Thread Robert Bradshaw
Various tags of the same image should at least logically be the same
thing, so I agree that we should not be trying to share a single
repository in that way. A full suite of apache/beam-{image_desc}
repositories, if apache is fine with that, seems like the best
approach.

On Wed, Jan 15, 2020 at 1:32 PM Kyle Weaver  wrote:
>
> +1, agree that moving current image name to tags is a non-starter. Thanks for 
> driving this Hannah. Let us know what they say about repo creation.
>
> On Wed, Jan 15, 2020 at 1:16 PM Udi Meiri  wrote:
>>
>> SG +1
>>
>> On Wed, Jan 15, 2020 at 12:59 PM Hannah Jiang  wrote:
>>>
>>> I have done some research about images released under apache namespace at 
>>> docker hub, and here is my proposal.
>>>
>>> Currently, we are using apachebeam as our namespace and each image has its 
>>> own repository. Version number is used to tag the images.
>>> ie: apachebeam/python2.7_sdk:2.19.0, apachebeam/flink1.9_job_server:2.19.0
>>>
>>> Now we are migrating to apache namespace and docker hub doesn't support 
>>> nested repository names, so we cannot use 
>>> apache/beam/{image-desc}:{version}.
>>> Instead, I propose to use apache/beam-{image_desc}:{version} as our 
>>> repository name.
>>> ie: apache/beam-python2.7_sdk:2.19.0, apache/beam-flink1.9_job_server:2.19.0
>>> => When a user searches for apache/beam at docker hub, it will list all the 
>>> repositories we deployed with apache/beam-, so no concerns that some 
>>> released images are missed by users.
>>> => Repository names give insights to the users which repositories they 
>>> should use.
>>> => A downside with this approach is we need to create a new repository 
>>> whenever we release a new image, time and effort needed for this is 
>>> pending, I am contacting apache docker hub management team.
>>>
>>> I have considered using beam as repository name and moving image name and 
>>> version to tags, (ie: apache/beam:python3.7_sdk_2.19.0), which means put 
>>> all images to a single repository, however, this approach has some 
>>> downsides.
>>> => When a user searches for apache/beam, only one repository is returned. 
>>> Users need to use tags to identify which images they should use. Since we 
>>> release images with new tags for each version, it will overwhelm the users 
>>> and give them an impression that the images are not organized well. It's 
>>> also difficult to know what kind of images we deployed.
>>> => With both image name and version included at tags, it is a little bit 
>>> more complicated to maintain the code.
>>> => There is no correct answer which image the latest tag should point to.
>>>
>>> Are there any concerns with this proposal?
>>>
>>> Thanks,
>>> Hannah
>>>
>>>
>>>
>>>
>>> On Fri, Jan 10, 2020 at 4:19 PM Ahmet Altay  wrote:



 On Fri, Jan 10, 2020 at 3:33 PM Ahmet Altay  wrote:
>
>
>
> On Fri, Jan 10, 2020 at 3:32 PM Ankur Goenka  wrote:
>>
>> Also curious to know if apache provide any infra support fro projects 
>> under Apache umbrella and any quota limits they might have.


 Maybe Hannah can ask with an infra ticket?

>>
>>
>> On Fri, Jan 10, 2020, 2:26 PM Robert Bradshaw  
>> wrote:
>>>
>>> One downside is that, unlike many of these projects, we release a
>>> dozen or so containers. Is there exactly (and only) one level of
>>> namespacing/nesting we can leverage here? (This isn't a blocker, but
>>> something to consider.)
>
>
> After a quick search, I could not find a way to use more than one level 
> of repositories. We can use the naming scheme we currently use to help 
> with. Our repositories are named as apachebeam/X, we could start using 
> apache/beam/X.
>
>>>
>>>
>>> On Fri, Jan 10, 2020 at 2:06 PM Hannah Jiang  
>>> wrote:
>>> >
>>> > Thanks Ahmet for proposing it.
>>> > I will take it and work towards v2.19.


 Missed this part. Thank you Hannah!

>>>
>>> >
>>> > Hannah
>>> >
>>> > On Fri, Jan 10, 2020 at 1:50 PM Kyle Weaver  
>>> > wrote:
>>> >>
>>> >> It'd be nice to have the clout/official sheen of apache attached to 
>>> >> our containers. Although getting the required permissions might add 
>>> >> some small overhead to the release process. For example, yesterday, 
>>> >> when we needed to create new repositories (not just update existing 
>>> >> ones), since we have top-level ownership of the apachebeam 
>>> >> organization, it was quick and easy to add them. I imagine we'd have 
>>> >> had to get approval from someone outside the project to do that 
>>> >> under the apache org. But this won't need to happen very often, so 
>>> >> it's probably not that big a deal.
>>> >>
>>> >> On Fri, Jan 10, 2020 at 1:40 PM Ahmet Altay  wrote:
>>> >>>
>>> >>> Hi all,
>>> >>>
>>> >>> I saw recent progress on the containers and wanted to bring this 

Re: Apache community contact point

2020-01-15 Thread Hannah Jiang
Thanks Andrew, I will try with Jira.

On Wed, Jan 15, 2020 at 1:13 PM Andrew Pilloud  wrote:

> I'm not sure you have the right contact point. Have you tried filing a
> JIRA ticket with the INFRA project and Docker component? JIRA is
> usually the best way to get changes made to Apache infrastructure.
>
> Andrew
>
> On Wed, Jan 15, 2020 at 1:03 PM Hannah Jiang 
> wrote:
>
>> I am trying to contact the Apache community to deploy Beam images to
>> their organization at docker hub. I wrote an email to 
>> *d...@community.apache.org
>> * and it has been almost 48 hours, but haven't
>> received any response.
>>
>> To the people who have experience working with them, is this a correct
>> contact point? Are there any advice I can follow?
>>
>> Thanks,
>> Hannah
>>
>


Re: [PROPOSAL] Transition released containers to the official ASF dockerhub organization

2020-01-15 Thread Kyle Weaver
+1, agree that moving current image name to tags is a non-starter. Thanks
for driving this Hannah. Let us know what they say about repo creation.

On Wed, Jan 15, 2020 at 1:16 PM Udi Meiri  wrote:

> SG +1
>
> On Wed, Jan 15, 2020 at 12:59 PM Hannah Jiang 
> wrote:
>
>> I have done some research about images released under apache namespace at
>> docker hub, and here is my proposal.
>>
>> Currently, we are using apachebeam as our namespace and each image has
>> its own repository. Version number is used to tag the images.
>> ie: apachebeam/python2.7_sdk:2.19.0, apachebeam/flink1.9_job_server:2.19.0
>>
>> Now we are migrating to apache namespace and docker hub doesn't support
>> nested repository names, so we cannot use
>> apache/beam/{image-desc}:{version}.
>> Instead, I propose to use *apache/beam-{image_desc}:{version}* as our
>> repository name.
>> ie: apache/beam-python2.7_sdk:2.19.0,
>> apache/beam-flink1.9_job_server:2.19.0
>> => When a user searches for *apache/beam* at docker hub, it will list
>> all the repositories we deployed with apache/beam-, so no concerns that
>> some released images are missed by users.
>> => Repository names give insights to the users which repositories they
>> should use.
>> => A downside with this approach is we need to create a new repository
>> whenever we release a new image, time and effort needed for this is
>> pending, I am contacting apache docker hub management team.
>>
>> I have considered using beam as repository name and moving image name and
>> version to tags, (ie: apache/beam:python3.7_sdk_2.19.0), which means put
>> all images to a single repository, however, this approach has some
>> downsides.
>> => When a user searches for apache/beam, only one repository is returned.
>> Users need to use tags to identify which images they should use. Since we
>> release images with new tags for each version, it will overwhelm the users
>> and give them an impression that the images are not organized well. It's
>> also difficult to know what kind of images we deployed.
>> => With both image name and version included at tags, it is a little bit
>> more complicated to maintain the code.
>> => There is no correct answer which image the latest tag should point to.
>>
>> Are there any concerns with this proposal?
>>
>> Thanks,
>> Hannah
>>
>>
>>
>>
>> On Fri, Jan 10, 2020 at 4:19 PM Ahmet Altay  wrote:
>>
>>>
>>>
>>> On Fri, Jan 10, 2020 at 3:33 PM Ahmet Altay  wrote:
>>>


 On Fri, Jan 10, 2020 at 3:32 PM Ankur Goenka  wrote:

> Also curious to know if apache provide any infra support fro projects
> under Apache umbrella and any quota limits they might have.
>

>>> Maybe Hannah can ask with an infra ticket?
>>>
>>>

> On Fri, Jan 10, 2020, 2:26 PM Robert Bradshaw 
> wrote:
>
>> One downside is that, unlike many of these projects, we release a
>> dozen or so containers. Is there exactly (and only) one level of
>> namespacing/nesting we can leverage here? (This isn't a blocker, but
>> something to consider.)
>>
>
 After a quick search, I could not find a way to use more than one level
 of repositories. We can use the naming scheme we currently use to help
 with. Our repositories are named as apachebeam/X, we could start using
 apache/beam/X.


>
>> On Fri, Jan 10, 2020 at 2:06 PM Hannah Jiang 
>> wrote:
>> >
>> > Thanks Ahmet for proposing it.
>> > I will take it and work towards v2.19.
>>
>
>>> Missed this part. Thank you Hannah!
>>>
>>>
 >
>> > Hannah
>> >
>> > On Fri, Jan 10, 2020 at 1:50 PM Kyle Weaver 
>> wrote:
>> >>
>> >> It'd be nice to have the clout/official sheen of apache attached
>> to our containers. Although getting the required permissions might add 
>> some
>> small overhead to the release process. For example, yesterday, when we
>> needed to create new repositories (not just update existing ones), since 
>> we
>> have top-level ownership of the apachebeam organization, it was quick and
>> easy to add them. I imagine we'd have had to get approval from someone
>> outside the project to do that under the apache org. But this won't need 
>> to
>> happen very often, so it's probably not that big a deal.
>> >>
>> >> On Fri, Jan 10, 2020 at 1:40 PM Ahmet Altay 
>> wrote:
>> >>>
>> >>> Hi all,
>> >>>
>> >>> I saw recent progress on the containers and wanted to bring this
>> question to the attention of the dev list.
>> >>>
>> >>> Would it be possible to use the official ASF dockerhub
>> organization for new Beam container releases? Concretely, starting from
>> 2.19 could we release Beam containers to
>> https://hub.docker.com/u/apache instead of
>> https://hub.docker.com/u/apachebeam ?
>> >>>
>> >>> Ahmet
>>
>


Re: Ordering of element timestamp change and window function

2020-01-15 Thread Luke Cwik
Window assignment happens at the point in the pipeline the WindowInto
transform was applied. So in this case the window would have been assigned
using the original timestamp.

Grouping is by key and window.

On Tue, Jan 14, 2020 at 7:30 PM Ankur Goenka  wrote:

> Hi,
>
> I am not sure about the effect of the order of element timestamp change
> and window association has on a group by key.
> More specifically, what would be the behavior if we apply window -> change
> element timestamp -> Group By key.
> I think we should always apply window function after changing the
> timestamp of elements. Though this is neither checked nor a recommended
> practice in Beam.
>
> Example pipeline would look like this:
>
>   def applyTimestamp(value):
> return window.TimestampedValue((key, value), int(time.time())
>
> p \
> | 'Create' >> beam.Create(range(0, 10)) \
> | 'Fixed Window' >> beam.WindowInto(window.FixedWindows(5)) \
> | 'Apply Timestamp' >> beam.Map(applyTimestamp) \ # Timestamp
> is changed after windowing and before GBK
> | 'Group By Key' >> beam.GroupByKey() \
> | 'Print' >> beam.Map(print)
>
> Thanks,
> Ankur
>


Re: GSOC announced!

2020-01-15 Thread Rui Wang
Hi Xinbin,

I assume you want to contribute to Apache Beam while you are less
experienced, thus you want to seek for some mentorship?

This topic was discussed before. I don't think we decided to build a formal
mentorship program for Beam. Instead, would you share your interest first
and then probably we could ask if there are people that know the topic who
can actually mentor?


-Rui

On Wed, Jan 15, 2020 at 9:30 AM Xinbin Huang  wrote:

> Hi community,
>
> I am pretty new to the apache beam community and want to contribute to the
> project. I think GCOS is a great opportunity for people to learn and
> contribute, but I am not eligible for it because I am not a student. That
> being said, would that be opportunities for non-students to participate in
> this or other opportunities that is suitable for less experienced people
> that want to contribute?
>
> Thanks!
> Bin
>
> On Wed, Jan 15, 2020 at 8:52 AM Ismaël Mejía  wrote:
>
>> Thanks for bringing this info. +1 on the Nexmark + Python + Portability
>> project.
>> Let's sync on that one Pablo. I am interested on co-mentoring it.
>>
>>
>> On Tue, Jan 14, 2020 at 7:55 PM Rui Wang  wrote:
>>
>>> Great! I will try to propose something for BeamSQL.
>>>
>>>
>>> -Rui
>>>
>>> On Tue, Jan 14, 2020 at 10:40 AM Pablo Estrada 
>>> wrote:
>>>
 Hello everyone,

 As with every year, the Google Summer of Code has been announced[1], so
 we can start preparing for it if anyone is interested. It's early in the
 process for now, but it's good to prepare early : )

 Here are the ASF mentor guidelines[2]. For now, the thing to do is to
 file JIRA issues for your projects, and apply the labels "mentor", "gsoc",
 "gsoc2020".

 When the time comes, the next steps are to join the
 ment...@community.apache.org list, and request the PMC for approval of
 a project.

 My current plan is to have these projects, though these are subject to
 change:
 - Build Nexmark pipelines for Python SDK (Ismael FYI)
 - Azure Blobstore File System for Java & Python

 I'll try to keep the dev@ list updated with other steps of the process.
 Thanks!
 -P.

 [1] https://summerofcode.withgoogle.com/
 [2]
 https://community.apache.org/gsoc.html#prospective-asf-mentors-read-this

>>>


Re: [PROPOSAL] Transition released containers to the official ASF dockerhub organization

2020-01-15 Thread Udi Meiri
SG +1

On Wed, Jan 15, 2020 at 12:59 PM Hannah Jiang 
wrote:

> I have done some research about images released under apache namespace at
> docker hub, and here is my proposal.
>
> Currently, we are using apachebeam as our namespace and each image has its
> own repository. Version number is used to tag the images.
> ie: apachebeam/python2.7_sdk:2.19.0, apachebeam/flink1.9_job_server:2.19.0
>
> Now we are migrating to apache namespace and docker hub doesn't support
> nested repository names, so we cannot use
> apache/beam/{image-desc}:{version}.
> Instead, I propose to use *apache/beam-{image_desc}:{version}* as our
> repository name.
> ie: apache/beam-python2.7_sdk:2.19.0,
> apache/beam-flink1.9_job_server:2.19.0
> => When a user searches for *apache/beam* at docker hub, it will list all
> the repositories we deployed with apache/beam-, so no concerns that some
> released images are missed by users.
> => Repository names give insights to the users which repositories they
> should use.
> => A downside with this approach is we need to create a new repository
> whenever we release a new image, time and effort needed for this is
> pending, I am contacting apache docker hub management team.
>
> I have considered using beam as repository name and moving image name and
> version to tags, (ie: apache/beam:python3.7_sdk_2.19.0), which means put
> all images to a single repository, however, this approach has some
> downsides.
> => When a user searches for apache/beam, only one repository is returned.
> Users need to use tags to identify which images they should use. Since we
> release images with new tags for each version, it will overwhelm the users
> and give them an impression that the images are not organized well. It's
> also difficult to know what kind of images we deployed.
> => With both image name and version included at tags, it is a little bit
> more complicated to maintain the code.
> => There is no correct answer which image the latest tag should point to.
>
> Are there any concerns with this proposal?
>
> Thanks,
> Hannah
>
>
>
>
> On Fri, Jan 10, 2020 at 4:19 PM Ahmet Altay  wrote:
>
>>
>>
>> On Fri, Jan 10, 2020 at 3:33 PM Ahmet Altay  wrote:
>>
>>>
>>>
>>> On Fri, Jan 10, 2020 at 3:32 PM Ankur Goenka  wrote:
>>>
 Also curious to know if apache provide any infra support fro projects
 under Apache umbrella and any quota limits they might have.

>>>
>> Maybe Hannah can ask with an infra ticket?
>>
>>
>>>
 On Fri, Jan 10, 2020, 2:26 PM Robert Bradshaw 
 wrote:

> One downside is that, unlike many of these projects, we release a
> dozen or so containers. Is there exactly (and only) one level of
> namespacing/nesting we can leverage here? (This isn't a blocker, but
> something to consider.)
>

>>> After a quick search, I could not find a way to use more than one level
>>> of repositories. We can use the naming scheme we currently use to help
>>> with. Our repositories are named as apachebeam/X, we could start using
>>> apache/beam/X.
>>>
>>>

> On Fri, Jan 10, 2020 at 2:06 PM Hannah Jiang 
> wrote:
> >
> > Thanks Ahmet for proposing it.
> > I will take it and work towards v2.19.
>

>> Missed this part. Thank you Hannah!
>>
>>
>>> >
> > Hannah
> >
> > On Fri, Jan 10, 2020 at 1:50 PM Kyle Weaver 
> wrote:
> >>
> >> It'd be nice to have the clout/official sheen of apache attached to
> our containers. Although getting the required permissions might add some
> small overhead to the release process. For example, yesterday, when we
> needed to create new repositories (not just update existing ones), since 
> we
> have top-level ownership of the apachebeam organization, it was quick and
> easy to add them. I imagine we'd have had to get approval from someone
> outside the project to do that under the apache org. But this won't need 
> to
> happen very often, so it's probably not that big a deal.
> >>
> >> On Fri, Jan 10, 2020 at 1:40 PM Ahmet Altay 
> wrote:
> >>>
> >>> Hi all,
> >>>
> >>> I saw recent progress on the containers and wanted to bring this
> question to the attention of the dev list.
> >>>
> >>> Would it be possible to use the official ASF dockerhub
> organization for new Beam container releases? Concretely, starting from
> 2.19 could we release Beam containers to
> https://hub.docker.com/u/apache instead of
> https://hub.docker.com/u/apachebeam ?
> >>>
> >>> Ahmet
>



smime.p7s
Description: S/MIME Cryptographic Signature


Re: Apache community contact point

2020-01-15 Thread Andrew Pilloud
I'm not sure you have the right contact point. Have you tried filing a JIRA
ticket with the INFRA project and Docker component? JIRA is usually the
best way to get changes made to Apache infrastructure.

Andrew

On Wed, Jan 15, 2020 at 1:03 PM Hannah Jiang  wrote:

> I am trying to contact the Apache community to deploy Beam images to their
> organization at docker hub. I wrote an email to *d...@community.apache.org
> * and it has been almost 48 hours, but haven't
> received any response.
>
> To the people who have experience working with them, is this a correct
> contact point? Are there any advice I can follow?
>
> Thanks,
> Hannah
>


Apache community contact point

2020-01-15 Thread Hannah Jiang
I am trying to contact the Apache community to deploy Beam images to their
organization at docker hub. I wrote an email to *d...@community.apache.org
* and it has been almost 48 hours, but haven't
received any response.

To the people who have experience working with them, is this a correct
contact point? Are there any advice I can follow?

Thanks,
Hannah


Re: [PROPOSAL] Transition released containers to the official ASF dockerhub organization

2020-01-15 Thread Hannah Jiang
I have done some research about images released under apache namespace at
docker hub, and here is my proposal.

Currently, we are using apachebeam as our namespace and each image has its
own repository. Version number is used to tag the images.
ie: apachebeam/python2.7_sdk:2.19.0, apachebeam/flink1.9_job_server:2.19.0

Now we are migrating to apache namespace and docker hub doesn't support
nested repository names, so we cannot use
apache/beam/{image-desc}:{version}.
Instead, I propose to use *apache/beam-{image_desc}:{version}* as our
repository name.
ie: apache/beam-python2.7_sdk:2.19.0, apache/beam-flink1.9_job_server:2.19.0
=> When a user searches for *apache/beam* at docker hub, it will list all
the repositories we deployed with apache/beam-, so no concerns that some
released images are missed by users.
=> Repository names give insights to the users which repositories they
should use.
=> A downside with this approach is we need to create a new repository
whenever we release a new image, time and effort needed for this is
pending, I am contacting apache docker hub management team.

I have considered using beam as repository name and moving image name and
version to tags, (ie: apache/beam:python3.7_sdk_2.19.0), which means put
all images to a single repository, however, this approach has some
downsides.
=> When a user searches for apache/beam, only one repository is returned.
Users need to use tags to identify which images they should use. Since we
release images with new tags for each version, it will overwhelm the users
and give them an impression that the images are not organized well. It's
also difficult to know what kind of images we deployed.
=> With both image name and version included at tags, it is a little bit
more complicated to maintain the code.
=> There is no correct answer which image the latest tag should point to.

Are there any concerns with this proposal?

Thanks,
Hannah




On Fri, Jan 10, 2020 at 4:19 PM Ahmet Altay  wrote:

>
>
> On Fri, Jan 10, 2020 at 3:33 PM Ahmet Altay  wrote:
>
>>
>>
>> On Fri, Jan 10, 2020 at 3:32 PM Ankur Goenka  wrote:
>>
>>> Also curious to know if apache provide any infra support fro projects
>>> under Apache umbrella and any quota limits they might have.
>>>
>>
> Maybe Hannah can ask with an infra ticket?
>
>
>>
>>> On Fri, Jan 10, 2020, 2:26 PM Robert Bradshaw 
>>> wrote:
>>>
 One downside is that, unlike many of these projects, we release a
 dozen or so containers. Is there exactly (and only) one level of
 namespacing/nesting we can leverage here? (This isn't a blocker, but
 something to consider.)

>>>
>> After a quick search, I could not find a way to use more than one level
>> of repositories. We can use the naming scheme we currently use to help
>> with. Our repositories are named as apachebeam/X, we could start using
>> apache/beam/X.
>>
>>
>>>
 On Fri, Jan 10, 2020 at 2:06 PM Hannah Jiang 
 wrote:
 >
 > Thanks Ahmet for proposing it.
 > I will take it and work towards v2.19.

>>>
> Missed this part. Thank you Hannah!
>
>
>> >
 > Hannah
 >
 > On Fri, Jan 10, 2020 at 1:50 PM Kyle Weaver 
 wrote:
 >>
 >> It'd be nice to have the clout/official sheen of apache attached to
 our containers. Although getting the required permissions might add some
 small overhead to the release process. For example, yesterday, when we
 needed to create new repositories (not just update existing ones), since we
 have top-level ownership of the apachebeam organization, it was quick and
 easy to add them. I imagine we'd have had to get approval from someone
 outside the project to do that under the apache org. But this won't need to
 happen very often, so it's probably not that big a deal.
 >>
 >> On Fri, Jan 10, 2020 at 1:40 PM Ahmet Altay 
 wrote:
 >>>
 >>> Hi all,
 >>>
 >>> I saw recent progress on the containers and wanted to bring this
 question to the attention of the dev list.
 >>>
 >>> Would it be possible to use the official ASF dockerhub organization
 for new Beam container releases? Concretely, starting from 2.19 could we
 release Beam containers to https://hub.docker.com/u/apache instead of
 https://hub.docker.com/u/apachebeam ?
 >>>
 >>> Ahmet

>>>


Re: [DISCUSS] BIP reloaded

2020-01-15 Thread Kenneth Knowles
Focusing this thread on the BIP process seems wise, without changing much
else in the same thread. I don't think the BIP process has to do with
exactly how design docs are written or archived, but the ability to *at a
glance* understand:

 - what are the high level proposals
 - status of the proposals
 - who to contact
 - how to get to more info (links to design docs, thread, Jiras, etc)

A page with a table on cwiki is common and seems good for this. How we
manage such a table would be a possible next step. I think they should
focus on large changes that need heavyweight process, so should encourage
lightweight creation. I think adding heavy process to smaller changes would
be bad.



I have looked multiple times at other projects (linked in prior thread and
in this thread too but gathering them here)

Spark: https://spark.apache.org/improvement-proposals.html
 - Jira is not good for "at a glance" reading. The title should have a
short and easy to understand paragraph.

Kafka:
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals
 - Quite a lot of content; I would prefer 10s of proposals. But it is
readable enough. Table lacks important content like links and summaries.
 - Blends the table with a bunch of header material that IMO ets in the way

Flink:
https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
 - Looks very similar to Kafka
 - Target Release is too specific, and actual status is missing

Airflow:
https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals
 - seems best organized, and the table has more info
 - having sections for the different status proposals in different tables
is great
 - "InRelease" column is left blank

It seems there is a lot of redundancy with Jira fields - owner, release,
etc. I think that redundancy is good. If it is too much effort to
redundantly manage to write it in the table then it probably is not
appropriate for heavyweight process. Anything that is one simple task that
fits in a Jira that can be passed around from person to person shouldn't be
a BIP. Probably anything where we can guess the target version isn't big
enough for a BIP.

Kenn

On Thu, Jan 9, 2020 at 7:59 AM Jan Lukavský  wrote:

> I think that, besides ownership of a feature, a BIP (or whatever document
> or process) should contain the following:
>
>  * description of a problem that the improvement addresses  - this is
> currently often part of design doc
>
>  * description of multiple possible solutions (if multiple exist, which is
> probably mostly the case)
>
>  * justifying choice of a particular solution
>
>  * result of a vote - the vote should cover both (a) do we don't this
> feature in the first place and (b) do we accept the proposed solution
>
> This would probably be iterative process involving multiple people,
> mailing list communication, etc. Pretty much what we do now, just there
> would be a place to keep track of decisions made throughout the process. I
> pretty much think that voting on complicated solutions is vital, the soft
> consensus approach is good for "simple" features (what that means might be
> subjective), but might fail for features where multiple more or less
> complex solutions exist. After successful PMC vote, the problem simplifies
> to reviewing code, the reviewer doesn't have to think about "do we want
> this feature?". That is given in advance. After we agree on the process and
> the form it should have I can volunteer to test it by letting proposal of
> ordered stateful processing pass through it.
> On 1/9/20 9:11 AM, Alex Van Boxel wrote:
>
> Maybe tweaking the current process a bit is enough. I like the Docs for
> having discussions but there no good as a *proper design document*, for
> the following reasons:
>
> I see design documents full of discussions and wonder:
>
>- Who will be the *main owner* and the *co-owners* (meaning people
>that are invested of bringing this forward and can *act* as *reviewers*).
>I think a proposal needs especially this: ownership
>- Lack of visibility of final state? Or is it superseded by another
>proposal. A final state could include the votes...
>- Does the proposal need amendments. An example,  while implementing
>the proposal, we see that something in the design was lacking and needs to
>be added.
>
> So the Docs are great, but maybe we should a few mandatory blocks and a
> few rules:
>
>- *Resolve all discussions* before switching to final state.
>- If new discussions pop up, maybe an amendment needs to be made (or
>correct). Corrections could be added to a *changelog* in the beginning.
>- If a new proposal supersedes on, both should be linked
>- Most importantly: Who can act as *owner* end reviewers for this
>proposal.
>
>
>
>  _/
> _/ Alex Van Boxel
>
>
> On Thu, Jan 9, 2020 at 7:59 AM Kenneth Knowles  wrote:
>
>> It does seem that the community would find this useful. I agree with
>> 

Re: GSOC announced!

2020-01-15 Thread Xinbin Huang
Hi community,

I am pretty new to the apache beam community and want to contribute to the
project. I think GCOS is a great opportunity for people to learn and
contribute, but I am not eligible for it because I am not a student. That
being said, would that be opportunities for non-students to participate in
this or other opportunities that is suitable for less experienced people
that want to contribute?

Thanks!
Bin

On Wed, Jan 15, 2020 at 8:52 AM Ismaël Mejía  wrote:

> Thanks for bringing this info. +1 on the Nexmark + Python + Portability
> project.
> Let's sync on that one Pablo. I am interested on co-mentoring it.
>
>
> On Tue, Jan 14, 2020 at 7:55 PM Rui Wang  wrote:
>
>> Great! I will try to propose something for BeamSQL.
>>
>>
>> -Rui
>>
>> On Tue, Jan 14, 2020 at 10:40 AM Pablo Estrada 
>> wrote:
>>
>>> Hello everyone,
>>>
>>> As with every year, the Google Summer of Code has been announced[1], so
>>> we can start preparing for it if anyone is interested. It's early in the
>>> process for now, but it's good to prepare early : )
>>>
>>> Here are the ASF mentor guidelines[2]. For now, the thing to do is to
>>> file JIRA issues for your projects, and apply the labels "mentor", "gsoc",
>>> "gsoc2020".
>>>
>>> When the time comes, the next steps are to join the
>>> ment...@community.apache.org list, and request the PMC for approval of
>>> a project.
>>>
>>> My current plan is to have these projects, though these are subject to
>>> change:
>>> - Build Nexmark pipelines for Python SDK (Ismael FYI)
>>> - Azure Blobstore File System for Java & Python
>>>
>>> I'll try to keep the dev@ list updated with other steps of the process.
>>> Thanks!
>>> -P.
>>>
>>> [1] https://summerofcode.withgoogle.com/
>>> [2]
>>> https://community.apache.org/gsoc.html#prospective-asf-mentors-read-this
>>>
>>


Re: [DISCUSS] Integrate Google Cloud AI functionalities

2020-01-15 Thread Ahmet Altay
On Wed, Jan 15, 2020 at 8:12 AM Kamil Wasilewski <
kamil.wasilew...@polidea.com> wrote:

> Based on your feedback, I think it'd be fine to deal with the problem as
> follows:
> * for Python: put the transforms into `sdks/python/apache_beam/io/gcp/ai`
> * for Java: create a `google-cloud-platform-ai` module in
> `sdks/java/extensions` folder
>
> As for cross language, we expect those transforms to be quite simple, so
> the cost of implementing them twice is not that high.
>

One option would be to implement inference in a library like tfx_bsl [1].
It comes with a generalized Beam transform that can do inference either
from a saved model file or by using a service endpoint. The service
endpoint API option is there and could support cloud AI APIs. If we utilize
tfx_bsl, we will leverage the existing TFX integration and would avoid
creating a parallel set of transforms. Then for Java, we could enable the
same interface with cross language transform and offer a unified inference
API for both languages.

[1]
https://github.com/tensorflow/tfx-bsl/blob/a9f5b6128309595570cc6212f8076e7a20063ac2/tfx_bsl/beam/run_inference.py#L78



>
> Thanks for your input,
> Kamil
>
> On Wed, Jan 15, 2020 at 7:58 AM Alex Van Boxel  wrote:
>
>> If it's in Java also be careful to align with the current google cloud
>> IO's, certainly it's dependencies. The google IO's are not depending on the
>> the newest client libraries and that's something we're sometimes struggling
>> with when we depend on our own client libraries. So make sure to align them.
>>
>> Also note that although gRPC is vendored, the google IO's do still have
>> their own dependency on gRPC and this is the biggest reason for trouble.
>>
>>  _/
>> _/ Alex Van Boxel
>>
>>
>> On Wed, Jan 15, 2020 at 1:18 AM Luke Cwik  wrote:
>>
>>> It depends on what language the client libraries are exposed in. For
>>> example, if the client libraries are in Java, sdks/java/extensions makes
>>> sense while if its Python then integrating it within the gcp extension
>>> within sdks/python/apache_beam makes sense.
>>>
>>> Adding additional dependencies is ok depending on the licensing and the
>>> process is slightly different for each language.
>>>
>>> For transforms that are complicated, there is a cross language effort
>>> going on so that one can execute one language's transforms within another
>>> languages pipeline which may remove the need to write the transforms more
>>> then once.
>>>
>>> On Tue, Jan 14, 2020 at 7:43 AM Ismaël Mejía  wrote:
>>>
 Nice idea, IO looks like a good place for them but there is another
 path that could fit this case: `sdks/java/extensions`, some module like
 `google-cloud-platform-ai` in that folder or something like that, no?

 In any case great initiative. +1



 On Tue, Jan 14, 2020 at 4:22 PM Kamil Wasilewski <
 kamil.wasilew...@polidea.com> wrote:

> Hi all,
>
> We’d like to implement a set of PTransforms that would allow users to
> use some of the Google Cloud AI services in Beam pipelines.
>
> Here's the full list of services and functionalities we’d like to
> integrate Beam with:
>
> * Video Intelligence [1]
>
> * Cloud Natural Language [2]
>
> * Cloud AI Platform Prediction [3]
>
> * Data Masking/Tokenization [4]
>
> * Inspecting image data for sensitive information using Cloud Vision
> [5]
>
> However, we're not sure whether to put those transforms directly into
> Beam, because they would require some additional GCP dependencies. One of
> our ideas is a separate library, that depends on Beam and that can be
> installed optionally, stored somewhere in the beam repository (e.g. in the
> BEAM_ROOT/extras directory). Do you think it is a reasonable approach? Or
> maybe it is totally fine to put them into SDKs, just like other IOs?
>
> If you have any other thoughts, do not hesitate to let us know.
>
> Best,
>
> Kamil
>
> [1] https://cloud.google.com/video-intelligence/
>
> [2] https://cloud.google.com/natural-language/
>
> [3] https://cloud.google.com/ml-engine/docs/prediction-overview
>
> [4]
> https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming#dlptexttobigquerystreaming
>
> [5] https://cloud.google.com/vision/
>



Re: GSOC announced!

2020-01-15 Thread Ismaël Mejía
Thanks for bringing this info. +1 on the Nexmark + Python + Portability
project.
Let's sync on that one Pablo. I am interested on co-mentoring it.


On Tue, Jan 14, 2020 at 7:55 PM Rui Wang  wrote:

> Great! I will try to propose something for BeamSQL.
>
>
> -Rui
>
> On Tue, Jan 14, 2020 at 10:40 AM Pablo Estrada  wrote:
>
>> Hello everyone,
>>
>> As with every year, the Google Summer of Code has been announced[1], so
>> we can start preparing for it if anyone is interested. It's early in the
>> process for now, but it's good to prepare early : )
>>
>> Here are the ASF mentor guidelines[2]. For now, the thing to do is to
>> file JIRA issues for your projects, and apply the labels "mentor", "gsoc",
>> "gsoc2020".
>>
>> When the time comes, the next steps are to join the
>> ment...@community.apache.org list, and request the PMC for approval of a
>> project.
>>
>> My current plan is to have these projects, though these are subject to
>> change:
>> - Build Nexmark pipelines for Python SDK (Ismael FYI)
>> - Azure Blobstore File System for Java & Python
>>
>> I'll try to keep the dev@ list updated with other steps of the process.
>> Thanks!
>> -P.
>>
>> [1] https://summerofcode.withgoogle.com/
>> [2]
>> https://community.apache.org/gsoc.html#prospective-asf-mentors-read-this
>>
>


[CVE-2020-1929] Apache Beam MongoDB IO connector disables certificate trust verification

2020-01-15 Thread Ismaël Mejía
CVE-2020-1929 Apache Beam MongoDB IO connector disables certificate trust
verification

Severity: Major
Vendor: The Apache Software Foundation

Versions Affected:
Apache Beam 2.10.0 to 2.16.0

Description:
The Apache Beam MongoDB connector in versions 2.10.0 to 2.16.0 has an
option to
disable SSL trust verification. However this configuration is not respected
and
the certificate verification disables trust verification in every case. This
exclusion also gets registered globally which disables trust checking for
any
code running in the same JVM.

Mitigation:
Users of the affected versions should apply one of the following
mitigations:
- Upgrade to Apache Beam 2.17.0 or later

Acknowledgements:
This issue was reported (and fixed) by Colm Ó hÉigeartaigh.


Re: [DISCUSS] Integrate Google Cloud AI functionalities

2020-01-15 Thread Kamil Wasilewski
Based on your feedback, I think it'd be fine to deal with the problem as
follows:
* for Python: put the transforms into `sdks/python/apache_beam/io/gcp/ai`
* for Java: create a `google-cloud-platform-ai` module in
`sdks/java/extensions` folder

As for cross language, we expect those transforms to be quite simple, so
the cost of implementing them twice is not that high.

Thanks for your input,
Kamil

On Wed, Jan 15, 2020 at 7:58 AM Alex Van Boxel  wrote:

> If it's in Java also be careful to align with the current google cloud
> IO's, certainly it's dependencies. The google IO's are not depending on the
> the newest client libraries and that's something we're sometimes struggling
> with when we depend on our own client libraries. So make sure to align them.
>
> Also note that although gRPC is vendored, the google IO's do still have
> their own dependency on gRPC and this is the biggest reason for trouble.
>
>  _/
> _/ Alex Van Boxel
>
>
> On Wed, Jan 15, 2020 at 1:18 AM Luke Cwik  wrote:
>
>> It depends on what language the client libraries are exposed in. For
>> example, if the client libraries are in Java, sdks/java/extensions makes
>> sense while if its Python then integrating it within the gcp extension
>> within sdks/python/apache_beam makes sense.
>>
>> Adding additional dependencies is ok depending on the licensing and the
>> process is slightly different for each language.
>>
>> For transforms that are complicated, there is a cross language effort
>> going on so that one can execute one language's transforms within another
>> languages pipeline which may remove the need to write the transforms more
>> then once.
>>
>> On Tue, Jan 14, 2020 at 7:43 AM Ismaël Mejía  wrote:
>>
>>> Nice idea, IO looks like a good place for them but there is another path
>>> that could fit this case: `sdks/java/extensions`, some module like
>>> `google-cloud-platform-ai` in that folder or something like that, no?
>>>
>>> In any case great initiative. +1
>>>
>>>
>>>
>>> On Tue, Jan 14, 2020 at 4:22 PM Kamil Wasilewski <
>>> kamil.wasilew...@polidea.com> wrote:
>>>
 Hi all,

 We’d like to implement a set of PTransforms that would allow users to
 use some of the Google Cloud AI services in Beam pipelines.

 Here's the full list of services and functionalities we’d like to
 integrate Beam with:

 * Video Intelligence [1]

 * Cloud Natural Language [2]

 * Cloud AI Platform Prediction [3]

 * Data Masking/Tokenization [4]

 * Inspecting image data for sensitive information using Cloud Vision [5]

 However, we're not sure whether to put those transforms directly into
 Beam, because they would require some additional GCP dependencies. One of
 our ideas is a separate library, that depends on Beam and that can be
 installed optionally, stored somewhere in the beam repository (e.g. in the
 BEAM_ROOT/extras directory). Do you think it is a reasonable approach? Or
 maybe it is totally fine to put them into SDKs, just like other IOs?

 If you have any other thoughts, do not hesitate to let us know.

 Best,

 Kamil

 [1] https://cloud.google.com/video-intelligence/

 [2] https://cloud.google.com/natural-language/

 [3] https://cloud.google.com/ml-engine/docs/prediction-overview

 [4]
 https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming#dlptexttobigquerystreaming

 [5] https://cloud.google.com/vision/

>>>


Re: Jenkins job execution policy

2020-01-15 Thread Michał Walenia
Thanks for adding the whitelist!
I have the same issue as Kirill, the tests run when I push commits, phrase
triggering works in a strange way - the jobs don't run after a comment, but
after a push following the comment. Is there a ghprb config that was
changed, limiting the range of github triggers for the jobs?
Michal

On Wed, Jan 15, 2020 at 1:55 AM Kirill Kozlov 
wrote:

> Thanks for working on this!
>
> I have noticed that tests run for new PRs and force-pushed commits, but if
> a test fails due to a flake I am unable to re-run it (ex: "Run Java
> PreCommit").
> PR that has this issue: https://github.com/apache/beam/pull/10369.
> Is this intended behaviour?
>
> -
> Kirill
>
> On Tue, Jan 14, 2020 at 3:20 PM Luke Cwik  wrote:
>
>> Does the approval list live beyond the lifetime of the jenkins machine
>> (my initial impression is that the approval list disappears on Jenkins
>> machine restart)?
>>
>> Also, I imagine that ASF wants an explicit way to see who is approved and
>> who is denied which the plugin doesn't seem to allow.
>>
>> On Tue, Jan 14, 2020 at 3:11 PM Pablo Estrada  wrote:
>>
>>> I've merged https://github.com/apache/beam/pull/10582 to unblock
>>> existing contributors that are having trouble getting their PRs tested
>>> without committer help. We can discuss Kai's suggestion.
>>>
>>> Looking at https://github.com/jenkinsci/ghprb-plugin, it seems like the
>>> 'add to whitelist' comment adds contributors permanently to a whitelist.
>>> This would have more immediate results than the .asf.yaml file. It would be
>>> harder to track who has the privilege, but it doesn't sound like that
>>> concerns us, right?
>>>
>>> Thoughts from others?
>>> -P.
>>>
>>> On Tue, Jan 14, 2020 at 1:43 PM Kai Jiang  wrote:
>>>
 Nice! I took a look at Beam Jenkins job properties (
 CommonJobProperties.groovy#L108-L111
 )
 and it uses jenkinsci/ghprb-plugin
 .
 It should support the feature of comment add to whitelist from
 committer on PR for adding new contributors to whitelist.
 Adding github account to asf yaml might be a little heavy if this
 approach works. Could we also test on this method?

 Best,
 Kai


 On Tue, Jan 14, 2020 at 1:16 PM Pablo Estrada 
 wrote:

> I've added all the PR authors for the last 1000 merged PRs. I will
> merge in a few minutes. I'll have a follow up change to document this on
> the website.
>
> On Tue, Jan 14, 2020 at 11:29 AM Luke Cwik  wrote:
>
>> Should we scrape all past contributors and add them to the file?
>>
>> On Tue, Jan 14, 2020 at 11:18 AM Kenneth Knowles 
>> wrote:
>>
>>> Nice! This will help at least temporarily. We can see if it grows
>>> too unwieldy. It is still unfriendly to newcomers.
>>>
>>> Kenn
>>>
>>> On Tue, Jan 14, 2020 at 11:06 AM Pablo Estrada 
>>> wrote:
>>>
 Hi all,
 ASF INFRA gave us a middle-ground sort of workaround for this by
 using .asf.yaml files. Here's a change to implement it[1], and
 documentation for the .asf.yaml file[2], as well as the relevant 
 section
 for our case[3].

 I'll check the docs in [2] well before pushing to merge, just to be
 sure we're not breaking anything.

 [1] https://github.com/apache/beam/pull/10582
 [2]
 https://cwiki.apache.org/confluence/display/INFRA/.asf.yaml+features+for+git+repositories

 [3]
 https://cwiki.apache.org/confluence/display/INFRA/.asf.yaml+features+for+git+repositories#id-.asf.yamlfeaturesforgitrepositories-JenkinsPRWhitelisting

 On Mon, Jan 13, 2020 at 3:29 PM Luke Cwik  wrote:

> I'm for going back to the status quo where anyone's PR ran the
> tests automatically or to the suggestion where users marked as 
> contributors
> had their tests run automatically (with the documentation update 
> about how
> link your github/jira accounts).
>
> On Mon, Jan 13, 2020 at 2:45 AM Michał Walenia <
> michal.wale...@polidea.com> wrote:
>
>> Hi,
>> I wanted to decouple the conversation about solutions to the
>> issue from job execution requests.
>> We have 131 open PRs right now and 64 committers with job running
>> privileges. From what I counted, more than 80 of those PRs are not 
>> authored
>> by committers.
>> I think that having committers answer testing and retesting
>> requests is a temporary solution and a permanent one should be 
>> decided upon
>> soon. While it's an inconvenience for contributors familiar with the
>> workings of the project and the community,