Re: [VOTE] Sign a pledge to discontinue support of Python 2 in 2020.

2019-10-01 Thread Lukasz Cwik
+1

On Tue, Oct 1, 2019 at 10:39 AM Ning Kang  wrote:

> +1
>
> On Tue, Oct 1, 2019 at 10:17 AM Pablo Estrada  wrote:
>
>> +1
>>
>> I guess it was http://python3statement.org : )
>>
>> On Tue, Oct 1, 2019 at 10:14 AM Mark Liu  wrote:
>>
>>> +1
>>>
>>> btw, the link (http://python3stament.org) you provided is broken.
>>>
>>> On Tue, Oct 1, 2019 at 9:44 AM Udi Meiri  wrote:
>>>
 +1

 On Tue, Oct 1, 2019 at 3:22 AM Łukasz Gajowy 
 wrote:

> +1
>
> wt., 1 paź 2019 o 11:29 Maximilian Michels 
> napisał(a):
>
>> +1
>>
>> On 30.09.19 23:03, Reza Rokni wrote:
>> > +1
>> >
>> > On Tue, 1 Oct 2019 at 13:54, Tanay Tummalapalli <
>> ttanay...@gmail.com
>> > > wrote:
>> >
>> > +1
>> >
>> > On Tue, Oct 1, 2019 at 8:19 AM Suneel Marthi <
>> smar...@apache.org
>> > > wrote:
>> >
>> > +1
>> >
>> > On Mon, Sep 30, 2019 at 10:33 PM Manu Zhang
>> > mailto:owenzhang1...@gmail.com>>
>> wrote:
>> >
>> > +1
>> >
>> > On Tue, Oct 1, 2019 at 9:44 AM Austin Bennett
>> > > > > wrote:
>> >
>> > +1
>> >
>> > On Mon, Sep 30, 2019 at 5:22 PM Valentyn Tymofieiev
>> > mailto:valen...@google.com>>
>> wrote:
>> >
>> > Hi everyone,
>> >
>> > Please vote whether to sign a pledge on behalf
>> of
>> > Apache Beam to sunset Beam Python 2 offering
>> (in new
>> > releases) in 2020 on http://python3stament.org
>>  as
>> > follows:
>> >
>> > [ ] +1: Sign a pledge to discontinue support of
>> > Python 2 in Beam in 2020.
>> > [ ] -1: Do not sign a pledge to discontinue
>> support
>> > of Python 2 in Beam in 2020.
>> >
>> > The motivation and details for this vote were
>> > discussed in [1, 2]. Please follow up in [2] if
>> you
>> > have any questions.
>> >
>> > This is a procedural vote [3] that will follow
>> the
>> > majority approval rules and will be open for at
>> > least 72 hours.
>> >
>> > Thanks,
>> > Valentyn
>> >
>> > [1]
>> >
>> https://lists.apache.org/thread.html/eba6caa58ea79a7ecbc8560d1c680a366b44c531d96ce5c699d41535@%3Cdev.beam.apache.org%3E
>> > [2]
>> >
>> https://lists.apache.org/thread.html/456631fe1a696c537ef8ebfee42cd3ea8121bf7c639c52da5f7032e7@%3Cdev.beam.apache.org%3E
>> > [3]
>> https://www.apache.org/foundation/voting.html
>> >
>> >
>> >
>> > --
>> >
>> > This email may be confidential and privileged. If you received this
>> > communication by mistake, please don't forward it to anyone else,
>> please
>> > erase all copies and attachments, and please let me know that it
>> has
>> > gone to the wrong person.
>> >
>> > The above terms reflect a potential business arrangement, are
>> provided
>> > solely as a basis for further discussion, and are not intended to
>> be and
>> > do not constitute a legally binding obligation. No legally binding
>> > obligations will be created, implied, or inferred until an
>> agreement in
>> > final form is executed in writing by all parties involved.
>> >
>>
>


Re: Introduction + Support in Comms for Beam!

2019-09-30 Thread Lukasz Cwik
Welcome to the community.

On Mon, Sep 30, 2019 at 3:15 PM María Cruz  wrote:

> Hi everyone,
> my name is María Cruz, I am from Buenos Aires but I live in the Bay Area.
> I recently became acquainted with Apache Beam project, and I got a chance
> to meet some of the Beam community at Apache Con North America this past
> September. I'm testing out a communications framework
> 
> for Open Source communities. I'm emailing the list now because I'd like to
> work on a communications strategy for Beam, to make the most of the content
> you produce during Beam Summits.
>
> A little bit more about me. I am a communications strategist with 11 years
> of experience in the field, 8 of which are in the non-profit sector. I
> started working in Open Source in 2013, when I joined Wikimedia, the social
> movement behind Wikipedia. I now work to support Google Open Source
> projects, and I also volunteer in the communications team of the Apache
> Software Foundation, working closely with Sally (for those of you who know
> her).
>
> I will be sending the list a proposal in the coming days. Looking forward
> to hearing from you!
>
> Best,
>
> María
>


Re: Shuffling on shardnum, is it necessary?

2019-09-27 Thread Lukasz Cwik
Using a state variable to store the shard key introduces a GroupByKey
within Dataflow to ensure that there is a strict ordering on state. Other
runners insert similar materializations to guarantee this as well.

Also a sufficiently powerful enough execution engine could do state
processing for the same key in parallel as long as they were able to
resolve state write conflicts.

On Fri, Sep 27, 2019 at 8:47 AM Shannon Duncan 
wrote:

> Yes, Specifically TextIO withNumShards().
>
> On Fri, Sep 27, 2019 at 10:45 AM Reuven Lax  wrote:
>
>> I'm not sure what you mean by "write out ot a specific shard number." Are
>> you talking about FIleIO sinks?
>>
>> Reuven
>>
>> On Fri, Sep 27, 2019 at 7:41 AM Shannon Duncan <
>> joseph.dun...@liveramp.com> wrote:
>>
>>> So when beam writes out to a specific shard number, as I understand it
>>> does a few things:
>>>
>>> - Assigns a shard key to each record (reduces parallelism)
>>> - Shuffles and Groups by the shard key to colocate all records
>>> - Writes out to each shard file within a single DoFn per key...
>>>
>>> When thinking about this, I believe we might be able to eliminate the
>>> GroupByKey to go ahead and write out to each file with its records with
>>> only a DoFn after the shard key is assigned.
>>>
>>> As long as the shard key is the actual key of the PCollection, then
>>> could we use a state variable to force all keys that are the same to
>>> process to share state with each other?
>>>
>>> On a DoFn can we use the setup to hold a Map of files being written to
>>> within bundles on that instance, and on teardown can we close all files
>>> within the map?
>>>
>>> If this is the case does it reduce the need for a shuffle and allow a
>>> DoFn to safely write out in append mode to a file, batch, etc held in
>>> state?
>>>
>>> It doesn't really decrease parallelism after the key is assigned since
>>> it can parallelize over each key within its state window. Which is the same
>>> level of parallelism we achieve by doing a GroupByKey and doing a for loop
>>> over the result. So performance shouldn't be impacted if this holds true.
>>>
>>> It's kind of like combining both the shuffle and the data write in the
>>> same step?
>>>
>>> This does however have a significant cost reduction by eliminating a
>>> compute based shuffle and also eliminating a Dataflow shuffle service call
>>> if shuffle service is enabled.
>>>
>>> Thoughts?
>>>
>>> Thanks,
>>> Shannon Duncan
>>>
>>


Re: Do we know why gradle scans are not working?

2019-09-26 Thread Lukasz Cwik
The summary so far is that the spotless plugin is somehow interfering with
what is being published in a way where the build scan is broken.

I tried an older version of the spotless plugin and also the latest version
and neither worked. I then tried to remove the spotless plugin[1] and was
able to successfully publish a build scan[2].

Still waiting on an update in the forum about how to keep both spotless and
build scans working.

1:
https://github.com/lukecwik/incubator-beam/commit/1f026c9b94e3f8afc0cd472350b77c1eeef30a83
2: https://gradle.com/s/665zarmdoixh6


On Wed, Sep 25, 2019 at 8:57 AM Lukasz Cwik  wrote:

> I reached out on the Gradle forum
> https://discuss.gradle.org/t/your-build-scan-could-not-be-displayed-what-does-this-mean/33302
>
> On Wed, Sep 25, 2019 at 8:49 AM Łukasz Gajowy  wrote:
>
>> FWIW, I tried doing it locally and observed the same behavior. It works
>> in my other private projects. This is all I know for now.
>>
>> Łukasz
>>
>> wt., 24 wrz 2019 o 22:43 Lukasz Cwik  napisał(a):
>>
>>> Not to my knowledge. Maybe something is down.
>>>
>>> Have you tried running a gradle build locally with --scan?
>>>
>>> On Tue, Sep 24, 2019 at 1:03 PM Valentyn Tymofieiev 
>>> wrote:
>>>
>>>> For example, https://gradle.com/s/mpfu3wpz2xfwe  says: Your build scan
>>>> could not be displayed.
>>>>
>>>


Re: Possible Python SDK performance regression

2019-09-25 Thread Lukasz Cwik
My environment has gotten all the dependencies installed/setup/maintained
organically over time as the project has evolved.


On Wed, Sep 25, 2019 at 9:56 AM Thomas Weise  wrote:

> The issue was related to how we build our custom packages.
>
> However, what might help users is documentation about the Cython setup,
> which is currently missing from the Python SDK docs.
>
> I'm also wondering how folks setup their environment for releases. Is it
> manual? Or is there a container that has all dependencies preinstalled?
>
> Thanks,
> Thomas
>
>
> On Wed, Sep 25, 2019 at 7:14 AM Valentyn Tymofieiev 
> wrote:
>
>> Thank you. In case there are details that would be relevant for others in
>> the community to avoid similar regressions, feel free to share them. We
>> also have Cython experts here who may be able to  advise.
>>
>>
>> On Wed, Sep 25, 2019 at 6:58 AM Thomas Weise  wrote:
>>
>>> After running through the entire bisect based on the 2.16 release branch
>>> I found that the regression was caused by our own Cython setup. So green
>>> light for the 2.16.0 release.
>>>
>>> Thomas
>>>
>>> On Tue, Sep 17, 2019 at 1:21 PM Thomas Weise  wrote:
>>>
 Hi Valentyn,

 Thanks for the reminder. The bisect is on my TODO list.

 Hopefully this week.

 I saw the discussion about declaring 2.16 LTS. We probably need to sort
 these performance concerns out prior to doing so.

 Thomas


 On Tue, Sep 17, 2019 at 12:02 PM Valentyn Tymofieiev <
 valen...@google.com> wrote:

> Hi Thomas,
>
> Just a reminder that 2.16.0 was cut and soon the voting may start, so
> to avoid the regression that you reported blocking the vote, it would be
> great to start investigate it if it is reproducible.
>
> Thanks,
> Valentyn
>
> On Tue, Sep 10, 2019 at 1:53 PM Valentyn Tymofieiev <
> valen...@google.com> wrote:
>
>> Thomas, did you have a change to open a Jira for the streaming
>> regression you observe? If not, could you please do so and cc +Ankur
>> Goenka  ? I talked with Ankur offline and he is
>> also interested in this regression.
>>
>> I opened:
>> - https://issues.apache.org/jira/browse/BEAM-8198 for batch
>> regression.
>> - https://issues.apache.org/jira/browse/BEAM-8199 to improve tooling
>> around performance monitoring.
>> - https://issues.apache.org/jira/browse/BEAM-8200 to add benchmarks
>> for streaming.
>>
>> I cc'ed some folks, however not everyone. Manisha, I could not find
>> your username in Jira, feel free to cc or assign BEAM-8199
>>   to yourself if
>> that is something you are actively working on.
>>
>> Thanks,
>> Valentyn
>>
>> On Mon, Sep 9, 2019 at 9:59 AM Mark Liu  wrote:
>>
>>> +Alan Myrvold  +Yifan Zou  It
 would be good to have alerts on benchmarks. Do we have such an ability
 today?

>>>
>>> As for regression detection, we have a Jenkins job
>>> beam_PerformanceTests_Analysis
>>> 
>>>  which
>>> analyzes metrics on Bigquery and report a summary to job console output.
>>> However, not all jobs are registered on this analyzer and currently no
>>> further alerts integrated with it (e.g. email / slack).
>>>
>>> There are ongoing work to add alerting to benchmarks. Kasia and
>>> Kamil are investigating on Prometheus + Grafana, and Manisha and me are
>>> looking into mako.dev.
>>>
>>> Mark
>>>
>>> On Fri, Sep 6, 2019 at 7:21 PM Ahmet Altay  wrote:
>>>
 I agree, let's investigate. Thomas could you file JIRAs once you
 have additional information.

 Valentyn, I think the performance regression could be investigated
 now, by running whatever benchmarks that is available against 2.14, 
 2.15
 and head and see if the same regression could be reproduced.

 On Fri, Sep 6, 2019 at 7:11 PM Valentyn Tymofieiev <
 valen...@google.com> wrote:

> Sounds like these regressions need to be investigated ahead of
> 2.16.0 release.
>
> On Fri, Sep 6, 2019 at 6:44 PM Thomas Weise 
> wrote:
>
>>
>>
>> On Fri, Sep 6, 2019 at 6:23 PM Ahmet Altay 
>> wrote:
>>
>>>
>>>
>>> On Fri, Sep 6, 2019 at 6:17 PM Thomas Weise 
>>> wrote:
>>>


 On Fri, Sep 6, 2019 at 2:24 PM Valentyn Tymofieiev >>> google.com> wrote:

> +Mark Liu  has added some benchmarks
> running across multiple Python versions. Specifically we run 1 GB 
> wordcount
> job on Dataflow runner on Python 2.7, 3.5-3.7. The benchmarks do 

Re: Do we know why gradle scans are not working?

2019-09-25 Thread Lukasz Cwik
I reached out on the Gradle forum
https://discuss.gradle.org/t/your-build-scan-could-not-be-displayed-what-does-this-mean/33302

On Wed, Sep 25, 2019 at 8:49 AM Łukasz Gajowy  wrote:

> FWIW, I tried doing it locally and observed the same behavior. It works in
> my other private projects. This is all I know for now.
>
> Łukasz
>
> wt., 24 wrz 2019 o 22:43 Lukasz Cwik  napisał(a):
>
>> Not to my knowledge. Maybe something is down.
>>
>> Have you tried running a gradle build locally with --scan?
>>
>> On Tue, Sep 24, 2019 at 1:03 PM Valentyn Tymofieiev 
>> wrote:
>>
>>> For example, https://gradle.com/s/mpfu3wpz2xfwe  says: Your build scan
>>> could not be displayed.
>>>
>>


Re: Collecting feedback for Beam usage

2019-09-24 Thread Lukasz Cwik
One of the options could be to just display the URL and not to phone home.
I would like it so that users can integrate this into their deployment
solution so we get regular stats instead of only when a user decides to run
a pipeline manually.

On Tue, Sep 24, 2019 at 11:13 AM Robert Bradshaw 
wrote:

> I think the goal is to lower the barrier of entry. Displaying a URL to
> click on while waiting for your pipeline to start up, that contains
> all the data explicitly visible, is about as easy as it gets.
> Remembering to run a new (probably not as authentic) pipeline with
> that flag is less so.
>
> On Tue, Sep 24, 2019 at 11:04 AM Mikhail Gryzykhin 
> wrote:
> >
> > I'm with Luke on this. We can add a set of flags to send home stats and
> crash dumps if user agrees. If we keep code isolated, it will be easy
> enough for user to check what is being sent.
> >
> > One more heavy-weight option is to also allow user configure and persist
> what information he is ok with sharing.
> >
> > --Mikhail
> >
> >
> > On Tue, Sep 24, 2019 at 10:02 AM Lukasz Cwik  wrote:
> >>
> >> Why not add a flag to the SDK that would do the phone home when
> specified?
> >>
> >> From a support perspective it would be useful to know:
> >> * SDK version
> >> * Runner
> >> * SDK provided PTransforms that are used
> >> * Features like user state/timers/side inputs/splittable dofns/...
> >> * Graph complexity (# nodes, # branches, ...)
> >> * Pipeline failed or succeeded
> >>
> >> On Mon, Sep 23, 2019 at 3:18 PM Robert Bradshaw 
> wrote:
> >>>
> >>> On Mon, Sep 23, 2019 at 3:08 PM Brian Hulette 
> wrote:
> >>> >
> >>> > Would people actually click on that link though? I think Kyle has a
> point that in practice users would only find and click on that link when
> they're having some kind of issue, especially if the link has "feedback" in
> it.
> >>>
> >>> I think the idea is that we would make the link very light-weight,
> >>> kind of like a survey (but even easier as it's pre-populated).
> >>> Basically an opt-in phone-home. If we don't collect any personal data
> >>> (not even IP/geo, just (say) version + runner, all visible in the
> >>> URL), no need to guard/anonymize (and this may be sufficient--I don't
> >>> think we have to worry about spammers and ballot stuffers given the
> >>> target audience). If we can catch people while they wait for their
> >>> pipeline to start up (and/or complete), this is a great time to get
> >>> some feedback.
> >>>
> >>> > I agree usage data would be really valuable, but I'm not sure that
> this approach would get us good data. Is there a way to get download
> statistics for the different runner artifacts? Maybe that could be a better
> metric to compare usage.
> >>>
> >>> This'd be useful too, but hard to get and very noisy.
> >>>
> >>> >
> >>> > On Mon, Sep 23, 2019 at 2:57 PM Ankur Goenka 
> wrote:
> >>> >>
> >>> >> I agree, these are the questions that need to be answered.
> >>> >> The data can be anonymize and stored as public data in BigQuery or
> some other place.
> >>> >>
> >>> >> The intent is to get the usage statistics so that we can get to
> know what people are using Flink or Spark etc and not intended for
> discussion or a help channel.
> >>> >> I also think that we don't need to monitor this actively as it's
> more like a survey rather than active channel to get issues resolved.
> >>> >>
> >>> >> If we think its useful for the community then we come up with the
> solution as to how can we do this (similar to how we released the container
> images).
> >>> >>
> >>> >>
> >>> >>
> >>> >> On Fri, Sep 20, 2019 at 4:38 PM Kyle Weaver 
> wrote:
> >>> >>>
> >>> >>> There are some logistics that would need worked out. For example,
> Where would the data go? Who would own it?
> >>> >>>
> >>> >>> Also, I'm not convinced we need yet another place to discuss Beam
> when we already have discussed the challenge of simultaneously monitoring
> mailing lists, Stack Overflow, Slack, etc. While "how do you use Beam" is
> certainly an interesting question, and I'd be curious to know that >= X
> many people use a certain runner, I'm not sure answers to these questions
> 

Re: Do we know why gradle scans are not working?

2019-09-24 Thread Lukasz Cwik
Not to my knowledge. Maybe something is down.

Have you tried running a gradle build locally with --scan?

On Tue, Sep 24, 2019 at 1:03 PM Valentyn Tymofieiev 
wrote:

> For example, https://gradle.com/s/mpfu3wpz2xfwe  says: Your build scan
> could not be displayed.
>


Re: Jenkins queue times steadily increasing for a few months now

2019-09-24 Thread Lukasz Cwik
We can get the per gradle task profile with the --profile flag:
https://jakewharton.com/static/files/trace/profile.html
This information also appears within the build scans that are sent to
Gradle.

Integrating with either of these sources of information would allow us to
figure out whether its new tasks or old tasks taking longer.

On Tue, Sep 24, 2019 at 12:23 PM Robert Bradshaw 
wrote:

> Does anyone know how to gather stats on where the time is being spent?
> Several times the idea of consolidating many of the (expensive)
> validates runner integration tests into a single pipeline, and then
> running things individually only if that fails, has come up. I think
> that'd be a big win if indeed this is where our time is being spent.
>
> On Tue, Sep 24, 2019 at 12:13 PM Daniel Oliveira 
> wrote:
> >
> > Those ideas all sound good. I especially agree with trying to reduce
> tests first and then if we've done all we can there and latency is still
> too high, it means we need more workers. Also in addition to reducing the
> amount of tests, there's also running less important tests less frequently,
> particularly when it comes to postcommits since many of those are resource
> intensive. That would require people with good context around what our many
> postcommits are used for.
> >
> > Another idea I thought of is trying to avoid running automated tests
> outside of peak coding times. Ideally, during the times when we get the
> greatest amounts of PRs (and therefore precommits) we shouldn't have any
> postcommits running. If we have both pre and postcommits going at the same
> time during peak hours, our queue times will shoot up even if the total
> amount of work doesn't change much.
> >
> > Btw, you mentioned that this was a problem last year. Do you have any
> links to discussions about that? It seems like it could be useful.
> >
> > On Thu, Sep 19, 2019 at 1:10 PM Mikhail Gryzykhin 
> wrote:
> >>
> >> Hi Daniel,
> >>
> >> Generally this looks feasible since jobs wait for new worker to be
> available to start.
> >>
> >> Over time we added more tests and did not deprecate enough, this
> increases load on workers. I wonder if we can add something like total
> runtime of all running jobs? This will be a safeguard metric that will show
> amount of time we actually run jobs. If it increases with same amount of
> workers, that will prove that we are overloading them (inverse is not
> necessarily correct).
> >>
> >> On addressing this, we can review approaches we took last year and see
> if any of them apply. If I do some brainstorming, following ideas come to
> mind: add more work force, reduce amount of tests, do better work on
> filtering out irrelevant tests, cancel irrelevant jobs (ie: cancel tests if
> linter fails) and/or add option for cancelling irrelevant jobs. One more
> big point can be effort on deflaking, but we seem to be decent in this area.
> >>
> >> Regards,
> >> Mikhail.
> >>
> >>
> >> On Thu, Sep 19, 2019 at 12:22 PM Daniel Oliveira <
> danolive...@google.com> wrote:
> >>>
> >>> Hi everyone,
> >>>
> >>> A little while ago I was taking a look at the Precommit Latency
> metrics on Grafana (link) and saw that the monthly 90th percentile metric
> has been really increasing the past few months, from around 10 minutes to
> currently around 30 minutes.
> >>>
> >>> After doing some light digging I was shown this page (beam load
> statistics) which seems to imply that queue times are shooting up when all
> the test executors are occupied, and it seems this is happening longer and
> more often recently. I also took a look at the commit history for our
> Jenkins tests and I see that new tests have steadily been added.
> >>>
> >>> I wanted to bring this up with the dev@ to ask:
> >>>
> >>> 1. Is this accurate? Can anyone provide insight into the metrics? Does
> anyone know how to double check my assumptions with more concrete metrics?
> >>>
> >>> 2. Does anyone have ideas on how to address this?
> >>>
> >>> Thanks,
> >>> Daniel Oliveira
>


Re: Collecting feedback for Beam usage

2019-09-24 Thread Lukasz Cwik
Why not add a flag to the SDK that would do the phone home when specified?

>From a support perspective it would be useful to know:
* SDK version
* Runner
* SDK provided PTransforms that are used
* Features like user state/timers/side inputs/splittable dofns/...
* Graph complexity (# nodes, # branches, ...)
* Pipeline failed or succeeded

On Mon, Sep 23, 2019 at 3:18 PM Robert Bradshaw  wrote:

> On Mon, Sep 23, 2019 at 3:08 PM Brian Hulette  wrote:
> >
> > Would people actually click on that link though? I think Kyle has a
> point that in practice users would only find and click on that link when
> they're having some kind of issue, especially if the link has "feedback" in
> it.
>
> I think the idea is that we would make the link very light-weight,
> kind of like a survey (but even easier as it's pre-populated).
> Basically an opt-in phone-home. If we don't collect any personal data
> (not even IP/geo, just (say) version + runner, all visible in the
> URL), no need to guard/anonymize (and this may be sufficient--I don't
> think we have to worry about spammers and ballot stuffers given the
> target audience). If we can catch people while they wait for their
> pipeline to start up (and/or complete), this is a great time to get
> some feedback.
>
> > I agree usage data would be really valuable, but I'm not sure that this
> approach would get us good data. Is there a way to get download statistics
> for the different runner artifacts? Maybe that could be a better metric to
> compare usage.
>
> This'd be useful too, but hard to get and very noisy.
>
> >
> > On Mon, Sep 23, 2019 at 2:57 PM Ankur Goenka  wrote:
> >>
> >> I agree, these are the questions that need to be answered.
> >> The data can be anonymize and stored as public data in BigQuery or some
> other place.
> >>
> >> The intent is to get the usage statistics so that we can get to know
> what people are using Flink or Spark etc and not intended for discussion or
> a help channel.
> >> I also think that we don't need to monitor this actively as it's more
> like a survey rather than active channel to get issues resolved.
> >>
> >> If we think its useful for the community then we come up with the
> solution as to how can we do this (similar to how we released the container
> images).
> >>
> >>
> >>
> >> On Fri, Sep 20, 2019 at 4:38 PM Kyle Weaver 
> wrote:
> >>>
> >>> There are some logistics that would need worked out. For example,
> Where would the data go? Who would own it?
> >>>
> >>> Also, I'm not convinced we need yet another place to discuss Beam when
> we already have discussed the challenge of simultaneously monitoring
> mailing lists, Stack Overflow, Slack, etc. While "how do you use Beam" is
> certainly an interesting question, and I'd be curious to know that >= X
> many people use a certain runner, I'm not sure answers to these questions
> are as useful for guiding the future of Beam as discussions on the
> dev/users lists, etc. as the latter likely result in more depth/specific
> feedback.
> >>>
> >>> However, I do think it could be useful in general to include links
> directly in the console output. For example, maybe something along the
> lines of "Oh no, your Flink pipeline crashed! Check Jira/file a bug/ask the
> mailing list."
> >>>
> >>> Kyle Weaver | Software Engineer | github.com/ibzib |
> kcwea...@google.com
> >>>
> >>>
> >>> On Fri, Sep 20, 2019 at 4:14 PM Ankur Goenka 
> wrote:
> 
>  Hi,
> 
>  At the moment we don't really have a good way to collect any usage
> statistics for Apache Beam. Like runner used etc. As many of the users
> don't really have a way to report their usecase.
>  How about if we create a feedback page where users can add their
> pipeline details and usecase.
>  Also, we can start printing the link to this page when user launch
> the pipeline in the command line.
>  Example:
>  $ python my_pipeline.py --runner DirectRunner --input /tmp/abc
> 
>  Starting pipeline
>  Please use
> http://feedback.beam.org?args=runner=DirectRunner,input=/tmp/abc
>  Pipeline started
>  ..
> 
>  Using a link and not publishing the data automatically will give user
> control over what they publish and what they don't. We can enhance the text
> and usage further but the basic idea is to ask for user feeback at each run
> of the pipeline.
>  Let me know what you think.
> 
> 
>  Thanks,
>  Ankur
>


Re: Next LTS?

2019-09-23 Thread Lukasz Cwik
I agree with what Lukasz Gajowy mentioned.

I find that Jenkins is fine when your developing at HEAD but as soon as you
cut a branch, the jenkins configuration starts to drift as it keeps getting
updated to HEAD with the seed job. I was always thinking that the Jenkins
configurations would be as thin as possible to invoke a single gradle
target that would be in the repo. This would allow for less drift since
most of the content would be coming from the repo itself.

On Fri, Sep 20, 2019 at 3:38 AM Łukasz Gajowy  wrote:

> And, yes, the fact that Jenkins jobs are separately evolving but pretty
> tightly coupled to the repo contents is a serious problem that I wish we
> had fixed. So verification of each PR was manual.
>
> Could you tell a little bit more about what exactly were the problems back
> then? Was that due to incompatible docker images used in portability tests
> maybe? Any other issues?
>
> My thoughts about Jenkins revolved around decoupling from Jenkins plugins,
> Groovy DSL etc and replace as much as possible with more universal tools
> (bash, Gradle). The main drivers were to: (1) be able to run the same thing
> that runs on Jenkins using bash/Gradle (same scripts), (2) potentially be
> able to replace Jenkins more easily with some more modern/better CI/CD tool
> in the future (Github Actions/Gitlab or simply newer Jenkins with
> Jenkinsfiles). I don't understand yet what was the problem cited above (I
> didn't work on the LTS back then) so I'm not sure it would help with
> releasing LTS versions with backports.
>
> Łukasz
>
> pt., 20 wrz 2019 o 02:01 Ahmet Altay  napisał(a):
>
>> I agree with retiring 2.7 as the LTS family. Based on my experience with
>> users 2.7 does not have a particularly high adoption and as pointed out has
>> known critical issues. Declaring another LTS pending demand sounds
>> reasonable but how are we going to gauge this demand?
>>
>> +Yifan Zou  +Alan Myrvold  on
>> the tooling question as well. Unless we address the tooling problem it
>> seems difficult to feasibly maintain LTS versions over time.
>>
>> On Thu, Sep 19, 2019 at 3:45 PM Austin Bennett <
>> whatwouldausti...@gmail.com> wrote:
>>
>>> To be clear, I was picking on - or reminding us of - the promise: I
>>> don't have a strong personal need/desire (at least currently) for LTS to
>>> exist.  Though, worth ensuring we live up to what we keep on the website.
>>> And, without an active LTS, probably something we should take off the site?
>>>
>>>
>>> On Thu, Sep 19, 2019 at 1:33 PM Pablo Estrada 
>>> wrote:
>>>
 +Łukasz Gajowy  had at some point thought
 of setting up jenkins jobs without coupling them to the state of the repo
 during the last Seed Job. It may be that that improvement can help test
 older LTS-type releases?

 On Thu, Sep 19, 2019 at 1:11 PM Robert Bradshaw 
 wrote:

> In many ways the 2.7 LTS was trying to flesh out the process. I think
> we learned some valuable lessons. It would have been good to push out
> something (even if it didn't have everything we wanted) but that is
> unlikely to be worth pursuing now (and 2.7 should probably be retired
> as LTS and no longer recommended).
>
> I agree that it does not seem there is strong demand for an LTS at
> this point. I would propose that we keep 2.16, etc. as potential
> candidates, but only declare one as LTS pending demand. The question
> of how to keep our tooling stable (or backwards/forwards compatible)
> is a good one, especially as we move to drop Python 2.7 in 2020 (which
> could itself be a driver for an LTS).
>
> On Thu, Sep 19, 2019 at 12:27 PM Kenneth Knowles 
> wrote:
> >
> > Yes, I pretty much dropped 2.7.1 release process due to lack of
> interest.
> >
> > There are known problems so that I cannot recommend anyone to use
> 2.7.0, yet 2.7 it is the current LTS family. So my work on 2.7.1 was
> philosophical. I did not like the fact that we had a designated LTS family
> with no usable releases.
> >
> > But many backports were proposed to block 2.7.1 and took a very long
> time to get contirbutors to implement the backports. I ended up doing many
> of them just to move it along. This indicates a lack of interest to me. 
> The
> problem is that we cannot really use a strict cut off date as a way to
> ensure people do the important things and skip the unimportant things,
> because we do know that the issues are critical.
> >
> > And, yes, the fact that Jenkins jobs are separately evolving but
> pretty tightly coupled to the repo contents is a serious problem that I
> wish we had fixed. So verification of each PR was manual.
> >
> > Altogether, I still think LTS is valuable to have as a promise to
> users that we will backport critical fixes. I would like to keep that
> promise and continue to try. Things that are rapidly changing (which
> 

Re: New contributor to BEAM SQL

2019-09-16 Thread Lukasz Cwik
Welcome Kirill, I have granted you the JIRA permissions you requested.

On Mon, Sep 16, 2019 at 10:59 AM Kirill Kozlov 
wrote:

> Hello everyone!
>
> My name is Kirill Kozlov, I recently joined a Dataflow team at Google and
> will be working on SQL filter pushdown.
> Can I get permission to work issues in jira, my username is: kirillkozlov
> Looking forward to developing Beam together!
>
> Thank you,
> Kirill Kozlov
>


Re: portableWordCountBatch and portableWordCountStreaming failing in Python PreCommit

2019-09-16 Thread Lukasz Cwik
I'm also being impacted by this on my PR[1]. I found BEAM-6316[2] that has
a similar error but it was resolved Dec 2018.

1: https://github.com/apache/beam/pull/9583
2: https://issues.apache.org/jira/browse/BEAM-6316

On Mon, Sep 16, 2019 at 12:43 PM Ning Kang  wrote:

> A new check renders clearer message:
>
> Unable to find image '
> jenkins-docker-apache.bintray.io/beam/flink-job-server:latest' locally
> docker: Error response from daemon: unknown: Repo 'apache' was not found.
> See 'docker run --help'.
> ERROR:root:Starting job service with ['docker', 'run', '-v',
> u'/usr/bin/docker:/bin/docker', '-v',
> '/var/run/docker.sock:/var/run/docker.sock', '--network=host', '
> jenkins-docker-apache.bintray.io/beam/flink-job-server:latest',
> '--job-host', 'localhost', '--job-port', '45687', '--artifact-port',
> '39407', '--expansion-port', '43893']
> ERROR:root:Error bringing up job service
>
>
>
> On Mon, Sep 16, 2019 at 12:39 PM Ning Kang  wrote:
>
>> To Ahmet, these are warnings, I'm not able to identify the errors yet.
>>
>> Thanks everyone! I'm watching the Jira now.
>>
>> On Mon, Sep 16, 2019 at 12:07 PM Chad Dombrova  wrote:
>>
>>> Ning, if you're having trouble making sense of the preCommit errors, you
>>> may be interested in this Jira:
>>> https://issues.apache.org/jira/browse/BEAM-8213#
>>>
>>> On Mon, Sep 16, 2019 at 12:02 PM Kyle Weaver 
>>> wrote:
>>>
 Python 2 isn't the reason the test is failing, that's just a warning.
 The actual error is at the very end of the log (it looks familiar to me,
 though I don't see a JIRA for it):

 <_Rendezvous of RPC that terminated with:
 status = StatusCode.UNIMPLEMENTED
 details = "Method
 org.apache.beam.model.job_management.v1.JobService/GetJobMetrics is
 unimplemented"
 debug_error_string =
 "{"created":"@1568424715.449291418","description":"Error received from peer
 ipv4:127.0.0.1:46117","file":"src/core/lib/surface/call.cc","file_line":1052,"grpc_message":"Method
 org.apache.beam.model.job_management.v1.JobService/GetJobMetrics is
 unimplemented","grpc_status":12}"
 >

 Kyle Weaver | Software Engineer | github.com/ibzib |
 kcwea...@google.com


 On Mon, Sep 16, 2019 at 11:34 AM Ning Kang  wrote:

> Hi! I've been seeing some errors during "Python PreCommit".
> I'm seeing "UserWarning: You are using Apache Beam with Python 2. New
> releases of Apache Beam will soon support Python 3 only. 'You are using
> Apache Beam with Python 2. '"
> Is there any plan to remove py2 tests from the pre-commit check once
> we stop supporting Python2?
> The scan link is:
> https://scans.gradle.com/s/vujoeo62uyfpi/console-log?task=:sdks:python:test-suites:portable:py2:portableWordCountBatch
>
> Thanks!
>
> Ning.
>
>


Re: How do you write portable runner pipeline on separate python code ?

2019-09-13 Thread Lukasz Cwik
And/or update the wiki/website with some how to's...

On Fri, Sep 13, 2019 at 7:51 AM Thomas Weise  wrote:

> I agree that loopback would be preferable for this purpose. I just wasn't
> aware this even works with the portable Flink runner. Is it one of the best
> guarded secrets? ;-)
>
> Kyle, can you please post the pipeline options you would use for Flink?
>
>
> On Thu, Sep 12, 2019 at 5:57 PM Kyle Weaver  wrote:
>
>> I prefer loopback because a) it writes output files to the local
>> filesystem, as the user expects, and b) you don't have to pull or build
>> docker images, or even have docker installed on your system -- which is one
>> less point of failure.
>>
>> Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com
>>
>>
>> On Thu, Sep 12, 2019 at 5:48 PM Thomas Weise  wrote:
>>
>>> This should become much better with 2.16 when we have the Docker images
>>> prebuilt.
>>>
>>> Docker is probably still the best option for Python on a JVM based
>>> runner in a local environment that does not have a development setup.
>>>
>>>
>>> On Thu, Sep 12, 2019 at 1:09 PM Kyle Weaver  wrote:
>>>
>>>> +dev  I think we should probably point new users
>>>> of the portable Flink/Spark runners to use loopback or some other
>>>> non-docker environment, as Docker adds some operational complexity that
>>>> isn't really needed to run a word count example. For example, Yu's pipeline
>>>> errored here because the expected Docker container wasn't built before
>>>> running.
>>>>
>>>> Kyle Weaver | Software Engineer | github.com/ibzib |
>>>> kcwea...@google.com
>>>>
>>>>
>>>> On Thu, Sep 12, 2019 at 11:27 AM Robert Bradshaw 
>>>> wrote:
>>>>
>>>>> On this note, making local files easy to read is something we'd
>>>>> definitely like to improve, as the current behavior is quite surprising.
>>>>> This could be useful not just for running with docker and the portable
>>>>> runner locally, but more generally when running on a distributed system
>>>>> (e.g. a Flink/Spark cluster or Dataflow). It would be very convenient if 
>>>>> we
>>>>> could automatically stage local files to be read as artifacts that could 
>>>>> be
>>>>> consumed by any worker (possibly via external directory mounting in the
>>>>> local docker case rather than an actual copy), and conversely copy small
>>>>> outputs back to the local machine (with the similar optimization for local
>>>>> docker).
>>>>>
>>>>> At the very least, however, obvious messaging when the local
>>>>> filesystem is used from within docker, which is often a (non-obvious and
>>>>> hard to debug) mistake should be added.
>>>>>
>>>>>
>>>>> On Thu, Sep 12, 2019 at 10:34 AM Lukasz Cwik  wrote:
>>>>>
>>>>>> When you use a local filesystem path and a docker environment, "/tmp"
>>>>>> is written inside the container. You can solve this issue by:
>>>>>> * Using a "remote" filesystem such as HDFS/S3/GCS/...
>>>>>> * Mounting an external directory into the container so that any
>>>>>> "local" writes appear outside the container
>>>>>> * Using a non-docker environment such as external or process.
>>>>>>
>>>>>> On Thu, Sep 12, 2019 at 4:51 AM Yu Watanabe 
>>>>>> wrote:
>>>>>>
>>>>>>> Hello.
>>>>>>>
>>>>>>> I would like to ask for help with my sample code using portable
>>>>>>> runner using apache flink.
>>>>>>> I was able to work out the wordcount.py using this page.
>>>>>>>
>>>>>>> https://beam.apache.org/roadmap/portability/
>>>>>>>
>>>>>>> I got below two files under /tmp.
>>>>>>>
>>>>>>> -rw-r--r-- 1 ywatanabe ywatanabe185 Sep 12 19:56
>>>>>>> py-wordcount-direct-1-of-2
>>>>>>> -rw-r--r-- 1 ywatanabe ywatanabe190 Sep 12 19:56
>>>>>>> py-wordcount-direct-0-of-2
>>>>>>>
>>>>>>> Then I wrote sample code with below steps.
>>>>>>>
>>>>>>> 1.Install apach

Fwd: Gradle Training Roundup

2019-09-12 Thread Lukasz Cwik
Forwarding some events to improve your Gradle-fu so that we can all improve
and maintain our build system.

-- Forwarded message -
From: Alex Leventer 
Date: Thu, Sep 12, 2019 at 10:55 AM
Subject: Gradle Training Roundup
To: 


Hi there,

Join us for these upcoming free, online, instructor-led training sessions
that will improve your overall team productivity when working with both
Gradle and Maven:

   - Sept 25: Hands-on workshop: Developer productivity engineering:
   Maintain performance as the codebase grows
   

-
   Use data and acceleration technologies to speed up the build/test/CI
   process and improve your developer experience. This session includes labs
   that optimize Maven and Gradle builds using Gradle Enterprise.

   - Sept 25: Build Cache Deep Dive
   

   - Tackle build performance issues using the Gradle local and Gradle
   Enterprise remote build cache which now supports both Gradle and Maven
   builds. This session covers build cache basics, task requirements,
   techniques for maximizing cache hits, troubleshooting and more.

   - Nov 12-13: Introduction to Gradle
   

   - Get hands-on training in Gradle fundamental features in less than one
   day. Topics include Gradle building blocks, best practices, plugins,
   dependency management, multi-project builds and more.


Best,
Alex

-- 
*Alex Leventer*
Customer Success
Gradle
325 9th Street, San Francisco, CA 94103



Would you like to opt-out?



Re: Feature request - file metadata AVP

2019-09-11 Thread Lukasz Cwik
In Java, each of the file system operations take an "options" class[1] such
as CreateOptions/MoveOptions.
In Python, there is an explicit field as a parameter[2].
Go doesn't seem to have those options available[3].

There was originally a plan to make those options classes be like
PipelineOptions where everything has a "view" and can be converted between
different types. Whether you decide to continue this path or try something
different, you will want to answer questions such as:
* How to provide an extensible way for providing file system specific
options (S3, GCS, ... all have different "options")?
* How do you plumb this through from PTransforms such as FileIO/Read/...?
* Which options are "filesystem" level while others are specific to an
individual file?
* Where do you provide hooks for people to set these options (pipeline
creation time or pipeline execution time)?

These design docs about filesystems may help provide additional background
[4, 5, 6].

1:
https://github.com/apache/beam/tree/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/fs
2:
https://github.com/apache/beam/blob/cb100199d5586414a642d6eaa72800c7463018ca/sdks/python/apache_beam/io/filesystem.py#L717
3:
https://github.com/apache/beam/blob/cb100199d5586414a642d6eaa72800c7463018ca/sdks/go/pkg/beam/io/filesystem/filesystem.go#L60
4:
https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJsVG3qel2lhdKTknmZ_7M
5:
https://docs.google.com/document/d/1-7vo9nLRsEEzDGnb562PuL4q9mUiq_ZVpCAiyyJw8p8
6: http://s.apache.org/fileio-write

On Mon, Sep 9, 2019 at 4:24 PM C.J. Collier  wrote:

> Hey folks,
>
> I have a client requesting both content-type and content-encoding for
> files.  The only thing currently supported is MIME-TYPE, which is not quite
> either.
>
> Could I get some help figuring out how best to allow arbitrary metadata to
> be associated with files?
>
> Cheers,
>
> C.J.
>
>
> https://issues.apache.org/jira/browse/BEAM-8180
>
> --
> C.J. Collier | Technical Solutions Engineer | c...@google.com | +1 (206)
> 531-1853 <(206)%20531-1853>
>


Re: Request to join group

2019-09-11 Thread Lukasz Cwik
Thanks for reaching out, I have added your Google account to the
apache-beam-testing project with Viewer permissions.

On Tue, Sep 10, 2019 at 6:19 PM Hubert Theodore 
wrote:

> Hi Beam Dev Team!
>
> I am a Google engineer working on Cloud Console Dataflow. I have been
> assigned a bug related to a job in apache-beam-testing project. Would you
> add me to the project so that I can view the job and resolve the bug?
>
> Thanks!
>
> --
>
> Hubert Theodore
>
> htheod...@google.com
>
> Software Engineer
>


Re: clickhouse tests failing

2019-09-08 Thread Lukasz Cwik
Is passing at head on Jenkins:
https://builds.apache.org/job/beam_PreCommit_Java_Cron/1771/testReport/org.apache.beam.sdk.io.clickhouse/

What are the failures your seeing at initialization? (the tests do rely on
setting up zookeeper and other stuff that could fail)

On Fri, Sep 6, 2019 at 12:36 PM Elliotte Rusty Harold 
wrote:

> At head I noticed the following:
>
>
> $ ./gradlew -p sdks/java/io/ check
> Configuration on demand is an incubating feature.
>
> > Task :sdks:java:io:clickhouse:test
>
> org.apache.beam.sdk.io.clickhouse.ClickHouseIOTest > classMethod FAILED
> java.lang.IllegalStateException
>
> org.apache.beam.sdk.io.clickhouse.ClickHouseIOTest > classMethod FAILED
> java.lang.NullPointerException
>
> org.apache.beam.sdk.io.clickhouse.AtomicInsertTest > classMethod FAILED
> java.lang.IllegalStateException
>
> org.apache.beam.sdk.io.clickhouse.AtomicInsertTest > classMethod FAILED
> java.lang.NullPointerException
>
> 29 tests completed, 4 failed
>
> > Task :sdks:java:io:clickhouse:test FAILED
>
> FAILURE: Build failed with an exception.
>
>
> Is anyone else seeing this? Are the tests expected to pass, or is
> there some requirement (e.g. Java 11) that I might be missing?
>
> --
> Elliotte Rusty Harold
> elh...@ibiblio.org
>


Re: [DISCUSS] Supporting multiple Flink versions vs. tech debt

2019-09-07 Thread Lukasz Cwik
When we import the Beam code into Google, we also run into issues where
sometimes we need to transform parts of the code. During import we use
copybara[1] to do these transformations to the source which are more then
just copy file X from some other path since most of the time we want to
change only a few lines and this really helps reduce the maintenance pain.
Unfortunately I don't see a Gradle plugin for copybara but I do imagine
there is a plugin that allows one to run SED like expressions or other
transformations instead of just maintaining duplicate copies of files.

1: https://github.com/google/copybara

On Sat, Sep 7, 2019 at 3:37 AM David Morávek  wrote:

> Hello,
>
> we currently have an opened PR for Flink 1.9
> , which greatly improves the
> runner for batch use-case. In case the PR gets merged, we would be
> supporting 5 latest major versions of Flink, which obviously come with high
> maintenance price and makes future development harder (there are already a
> sub-optimal parts due to compatibility with previous versions). Thomas and
> Max expressed needs for addressing the issue with the current release.
>
> Let's break down possible solution for the problem.
>
> *1) Current solution*
>
> Currently we maintain separate build for each version. The project
> structure looks as follows:
>
> *flink/*
> +
> *1.5/*
>+ *src/** # implementation of classes that differ between versions*
>- build.gradle
> + *1.6/*
>+ build.gradle #* the version is backward compatible, so it can reuse
> "overrides" from 1.5*
> + *1.7/*
>+ build.gradle #* the version is backward compatible, so it can reuse
> "overrides" from 1.5*
> + *1.8/*
>+ *src/ **# implementation of classes that differ between versions*
>- build.gradle
> + *1.9/*
>+ *src/ **# implementation of classes that differ between versions*
>- build.gradle
> + *src/*
> * # common source, shared among runner versions*
> - flink_runner.gradle
> * # included by  each /build.gradle*
>
> The problem with this structure is, that we always need to copy all of the
> version specific classes between backward incompatible versions, which
> results in *duplicate files* (we can not simply override a single file,
> because it wouldn't compile due to duplicate classes).
>
> *2) Symlink duplicates*
>
> Maybe we can simply symlink duplicates between versions and only override
> the files that need to be changed?
>
> *3) Adjusting the gradle build*
>
> Currently a version build looks something like this (this one is for 1.7.x
> version):
>
> project.ext {
>   // Set the version of all Flink-related dependencies here.
>   flink_version = '1.7.2'
>   // Main source directory and Flink version specific code.
>   main_source_dirs = ["$basePath/src/main/java", "../1.5/src/main/java"]
>   test_source_dirs = ["$basePath/src/test/java", "../1.5/src/test/java"]
>   main_resources_dirs = ["$basePath/src/main/resources"]
>   test_resources_dirs = ["$basePath/src/test/resources"]
>   archives_base_name = 'beam-runners-flink-1.7'
> }
>
> // Load the main build script which contains all build logic.
> apply from: "$basePath/flink_runner.gradle"
>
> It basically says, take the common source and append version specific
> implementations from 1.5 version. Let's say we want to override a single
> file for 1.8. We need to copy everything from 1.5/src and the build file
> would look as follows:
>
> /* All properties required for loading the Flink build script */
> project.ext {
>   // Set the version of all Flink-related dependencies here.
>   flink_version = '1.8.0'
>   // Main source directory and Flink version specific code.
>   main_source_dirs = ["$basePath/src/main/java", "./src/main/java"]
>   test_source_dirs = ["$basePath/src/test/java", "./src/test/java"]
>   main_resources_dirs = ["$basePath/src/main/resources"]
>   test_resources_dirs = ["$basePath/src/test/resources"]
>   archives_base_name = 'beam-runners-flink-1.8'
> }
>
> // Load the main build script which contains all build logic.
> apply from: "$basePath/flink_runner.gradle"
>
> For simplicity, let's only focus on *main_source_dirs*. What we really want
> to do is to tell the build, to use everything from 1.5 and override a
> single class (e.g. CoderTypeSerializer).
>
> def copyOverrides = tasks.register('copyOverrides', Copy) {
>   it.from '../1.5/src/', './src'
>   it.into "${project.buildDir}/flink-overrides/src"
>   it.duplicatesStrategy DuplicatesStrategy.INCLUDE // The last duplicate
> file 'wins'.
> }
>
> compileJava.dependsOn copyOverrides
>
> projext.ext {
>   main_source_dirs = ["$basePath/src/main/java",
> "${project.buildDir}/flink-overrides/src/main/java"]
> }
>
> This would copy all overrides into build directory, and it case of
> duplicate it picks the latest one. Than the build would simple compile
> classes from the newly created java files in build directory.
>
> *4) Maintaining last 3 major versions only*
>
> I recall that Flink 

Re: [VOTE] Vendored Dependencies Release

2019-09-05 Thread Lukasz Cwik
LGTM

On Wed, Sep 4, 2019 at 4:24 PM Rui Wang  wrote:

> Thanks Pablo for jumping in for help.
>
> Now the sources are moved to [1]. Please let me know if it is ok.
>
> [1]:
> https://dist.apache.org/repos/dist/release/beam/vendor/calcite/1_20_0/
>
> -Rui
>
> On Wed, Sep 4, 2019 at 4:15 PM Pablo Estrada  wrote:
>
>> I can help.
>>
>> On Wed, Sep 4, 2019 at 1:09 PM Rui Wang  wrote:
>>
>>> There is a step of releasing requires PMC permission:
>>>
>>> """
>>>
>>> Copy the source release from the dev repository to the release
>>> repository at dist.apache.org using Subversion.
>>> Move last release artifacts from dist.apache.org to archive.apache.org
>>> using Subversion. """
>>>
>>> Is there a PMC member could help on this operation to move [1] to
>>> "release" repo?
>>>
>>> [1]: [1]
>>> https://dist.apache.org/repos/dist/dev/beam/vendor/calcite/1_20_0
>>>
>>> -Rui
>>>
>>> On Wed, Sep 4, 2019 at 10:16 AM Rui Wang  wrote:
>>>
>>>> I'm happy to announce that we have unanimously approved this release.
>>>>
>>>> There are 5 approving votes, 3 of which are binding:
>>>>
>>>> * Lukasz Cwik
>>>>
>>>> * Kenneth Knowles
>>>>
>>>> * Ahmet Altay
>>>>
>>>> There are no disapproving votes.
>>>>
>>>> Thanks everyone!
>>>>
>>>> On Tue, Sep 3, 2019 at 1:29 PM Lukasz Cwik  wrote:
>>>>
>>>>> +1
>>>>>
>>>>> On Tue, Sep 3, 2019 at 1:22 PM Kenneth Knowles 
>>>>> wrote:
>>>>>
>>>>>> +1
>>>>>>
>>>>>> On Tue, Sep 3, 2019 at 11:00 AM Ahmet Altay  wrote:
>>>>>>
>>>>>>> +1
>>>>>>>
>>>>>>> On Tue, Sep 3, 2019 at 10:52 AM Andrew Pilloud 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> +1
>>>>>>>>
>>>>>>>> Inspected the jar it looked reasonable.
>>>>>>>>
>>>>>>>> Andrew
>>>>>>>>
>>>>>>>> On Tue, Sep 3, 2019 at 9:06 AM Rui Wang  wrote:
>>>>>>>>
>>>>>>>>> Friendly ping.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -Rui
>>>>>>>>>
>>>>>>>>> On Thu, Aug 29, 2019 at 9:50 AM Rui Wang 
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks Kai and Andrew. Now prgapachebeam-1083 is publicly exposed.
>>>>>>>>>>
>>>>>>>>>> I also found a useful link[1] to explain staging repos in Apache
>>>>>>>>>> Nexus
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [1]:
>>>>>>>>>> https://help.sonatype.com/repomanager2/staging-releases/managing-staging-repositories#ManagingStagingRepositories-ClosinganOpenRepository
>>>>>>>>>>
>>>>>>>>>> -Rui
>>>>>>>>>>
>>>>>>>>>> On Wed, Aug 28, 2019 at 9:19 PM Andrew Pilloud <
>>>>>>>>>> apill...@google.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> You need to close the release for it to be published to the
>>>>>>>>>>> staging server. I can help if you still have questions.
>>>>>>>>>>>
>>>>>>>>>>> Andrew
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Aug 28, 2019, 8:48 PM Rui Wang 
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I can see prgapachebeam-1083 is in open status in staging
>>>>>>>>>>>> repository. I am not sure why it is not public exposed. I probably 
>>>>>>>>>>>> need
>>>>>>>>>>>> some guidance on it.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> -Rui
>>&

Re: Save state on tear down

2019-09-03 Thread Lukasz Cwik
On Fri, Aug 30, 2019 at 6:52 PM Chad Dombrova  wrote:

> +dev
>
> I read the document on Drain, and it sounds very promising.  I have a few
> questions, starting with this statement from the doc:
>
>"This document proposes a new pipeline action called Drain. Drain can
> be implemented by runners by manipulating the watermark of the pipeline."
>
> What is a pipeline "action" and how would this be exposed to the user?  I
> assume this is externally and manually initiated.  Is this something that
> would be invoked from a PipelineResult object (i.e. akin to "cancel")?
>

The term is alluding to job management (run, cancel, get metrics, ...) and
yes this should be exposed as part of PipelineResult in addition to any
tooling that performs job management.


>
> Once a Drain is initiated on a pipeline, does this trigger a loop over all
> unbounded sources to set their watermark to infinity, or only certain
> ones?
>

Conceptually yes, the runner sets the watermark to infinity for all
"sources" preventing new data from being produced. The watermark then
progresses downstream throughout the execution graph causing windows to
close, timers to fire, state to be emit and then garbage collected and so
forth.


>
> thanks,
> -chad
>
>
> On Fri, Aug 16, 2019 at 2:47 PM Jose Delgado 
> wrote:
>
>> I see,  thank you  Lukasz.
>>
>>
>>
>> Regards,
>> Jose
>>
>> *From: *Lukasz Cwik 
>> *Reply-To: *"u...@beam.apache.org" 
>> *Date: *Monday, August 5, 2019 at 11:11 AM
>> *To: *user 
>> *Subject: *Re: Save state on tear down
>>
>>
>>
>> This is not possible today.
>>
>>
>>
>> There have been discussions about pipeline drain, snapshot and update [1,
>> 2] which may provide additional details of what is planned and could use
>> your feedback.
>>
>>
>>
>> 1:
>> https://docs.google.com/document/d/1NExwHlj-2q2WUGhSO4jTu8XGhDPmm3cllSN8IMmWci8
>>
>> 2:
>> https://docs.google.com/document/d/1UWhnYPgui0gUYOsuGcCjLuoOUlGA4QaY91n8p3wz9MY
>>
>>
>>
>> On Thu, Aug 1, 2019 at 3:44 PM Jose Delgado 
>> wrote:
>>
>> Hello All,
>>
>>
>>
>> I wondering if there is a way or pattern to save state on tear down ?
>>
>>
>>
>> In case of a failure or a pipeline we cannot update(due to significant
>> changes ) we would like to save the state and re-load it on the next
>> creation of the pipeline.
>>
>>
>>
>> Note: we are currently using Google Dataflow runner
>>
>>
>>
>> Regards,
>>
>> Jose
>>
>>


Re: [VOTE] Vendored Dependencies Release

2019-09-03 Thread Lukasz Cwik
+1

On Tue, Sep 3, 2019 at 1:22 PM Kenneth Knowles  wrote:

> +1
>
> On Tue, Sep 3, 2019 at 11:00 AM Ahmet Altay  wrote:
>
>> +1
>>
>> On Tue, Sep 3, 2019 at 10:52 AM Andrew Pilloud 
>> wrote:
>>
>>> +1
>>>
>>> Inspected the jar it looked reasonable.
>>>
>>> Andrew
>>>
>>> On Tue, Sep 3, 2019 at 9:06 AM Rui Wang  wrote:
>>>
 Friendly ping.


 -Rui

 On Thu, Aug 29, 2019 at 9:50 AM Rui Wang  wrote:

> Thanks Kai and Andrew. Now prgapachebeam-1083 is publicly exposed.
>
> I also found a useful link[1] to explain staging repos in Apache Nexus
>
>
> [1]:
> https://help.sonatype.com/repomanager2/staging-releases/managing-staging-repositories#ManagingStagingRepositories-ClosinganOpenRepository
>
> -Rui
>
> On Wed, Aug 28, 2019 at 9:19 PM Andrew Pilloud 
> wrote:
>
>> You need to close the release for it to be published to the staging
>> server. I can help if you still have questions.
>>
>> Andrew
>>
>> On Wed, Aug 28, 2019, 8:48 PM Rui Wang  wrote:
>>
>>> I can see prgapachebeam-1083 is in open status in staging
>>> repository. I am not sure why it is not public exposed. I probably need
>>> some guidance on it.
>>>
>>>
>>> -Rui
>>>
>>> On Wed, Aug 28, 2019 at 3:50 PM Kai Jiang 
>>> wrote:
>>>
 Hi Rui,

 For accessing artifacts [1] in Maven Central Repository, is this
 intent to be not public exposed?

 Best,
 Kai

 [1]
 https://repository.apache.org/content/repositories/orgapachebeam-1083/

 On Wed, Aug 28, 2019 at 11:57 AM Kai Jiang 
 wrote:

> +1 (non-binding)Thanks Rui!
>
> On Tue, Aug 27, 2019 at 10:46 PM Rui Wang 
> wrote:
>
>> Please review the release of the following artifacts that we
>> vendor:
>>
>>  * beam-vendor-calcite-1_20_0
>>
>> Hi everyone,
>>
>> Please review and vote on the release candidate #1 for the
>> org.apache.beam:beam-vendor-calcite-1_20_0:0.1, as follows:
>>
>> [ ] +1, Approve the release
>>
>> [ ] -1, Do not approve the release (please provide specific
>> comments)
>>
>>
>> The complete staging area is available for your review, which
>> includes:
>>
>> * the official Apache source release to be deployed to
>> dist.apache.org [1], which is signed with the key with
>> fingerprint 0D7BE1A252DBCEE89F6491BBDFA64862B703F5C8 [2],
>>
>> * all artifacts to be deployed to the Maven Central Repository
>> [3],
>>
>> * commit hash "664e25019fc1977e7041e4b834e8d9628b912473" [4],
>>
>> The vote will be open for at least 72 hours. It is adopted by
>> majority approval, with at least 3 PMC affirmative votes.
>>
>> Thanks,
>>
>> Rui
>>
>> [1]
>> https://dist.apache.org/repos/dist/dev/beam/vendor/calcite/1_20_0
>>
>> [2] https://dist.apache.org/repos/dist/release/beam/KEYS
>>
>> [3]
>> https://repository.apache.org/content/repositories/orgapachebeam-1083/
>>
>> [4]
>> https://github.com/apache/beam/commit/664e25019fc1977e7041e4b834e8d9628b912473
>>
>>


Re: How to Implement a Runner for C++ Streaming Processing Engine?

2019-08-30 Thread Lukasz Cwik
Sorry, meant to say that some re-usable parts exist in Python and Go (not
C++).

On Fri, Aug 30, 2019 at 8:36 AM Lukasz Cwik  wrote:

> There is an ongoing portability effort which is attempting to enable any
> runner execute any Beam SDK written in any language.
>
> A good starting point would be to take a look at:
> https://beam.apache.org/roadmap/portability/
> http://s.apache.org/beam-runner-api
> http://s.apache.org/beam-fn-api
>
> The idea behind portability is that you can create a language agnostic
> representation of the pipeline in proto and use gRPC based services to
> launch, manage and execute that pipeline. Since Proto and gPRC both have
> C++ bindings, you should be able to integrate your execution engine but one
> issue you'll run into is that most of the code that allows one to run
> portable pipelines is written in Java and some re-usable parts in exist in
> Python and C++.
>
> On Fri, Aug 30, 2019 at 7:40 AM 邵凯  wrote:
>
>> I have a streaming processing engine implemented in C++. Do you have any
>> good plans and suggestions to implement a runner?
>> I looked at the Beam Runner implementation guide and felt that I couldn't
>> implement a Runner for a C++ streaming data processing engine. Because I
>> can't let the C++ streaming data processing engine execute beam pipeline if
>> I implement Runner in java.
>>
>


Re: [PROPOSAL] Preparing for Beam 2.16.0 release

2019-08-29 Thread Lukasz Cwik
+1

On Thu, Aug 29, 2019 at 7:42 PM Alan Myrvold  wrote:

> +1 Thanks for keeping to the schedule, Mark.
>
> On Thu, Aug 29, 2019 at 6:21 PM jincheng sun 
> wrote:
>
>> Hi Mark,
>>
>> +1 and thank you for keeping the cadence!
>>
>> BTW I have mark the Fix Version for some of issues to 2.17, which can
>> not be merged into 2.16.
>>
>> Best,
>> Jincheng
>>
>> Mark Liu  于2019年8月29日周四 上午6:14写道:
>>
>>> Hi all,
>>>
>>> Beam 2.16 release branch cut is scheduled on Sep 11 according to the
>>> release calendar [1]. I would like to volunteer myself to do this
>>> release. The plan is to cut the branch on that date, and cherrypick
>>> release-blocking fixes afterwards if any.
>>>
>>> If you have release blocking issues for 2.16 please mark their "Fix
>>> Version" as 2.16.0 [2]. This tag is already created in JIRA in case you
>>> would like to move any non-blocking issues to that version.
>>>
>>> Any thoughts, comments, objections?
>>>
>>> Regards.
>>> Mark Liu
>>>
>>> [1]
>>> https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com
>>> [2]
>>> https://issues.apache.org/jira/browse/BEAM-8105?jql=project%20%3D%20BEAM%20AND%20status%20in%20(Reopened%2C%20Open%2C%20%22In%20Progress%22%2C%20%22Under%20Discussion%22%2C%20%22In%20Implementation%22%2C%20%22Triage%20Needed%22)%20AND%20fixVersion%20%3D%202.16.0
>>>
>>


Re: Improve container support

2019-08-28 Thread Lukasz Cwik
Google locks down docs created wtih @google.com addresses. Hannah please
recreate the doc using a non @google.com address and share it with the
community. You'll want to replace Google short link with an Apache short
link (s.apache.org).

On Wed, Aug 28, 2019 at 5:40 AM Gleb Kanterov  wrote:

> Google Doc doesn't seem to be shared with dev@. Can anybody double-check?
>
> On Wed, Aug 28, 2019 at 7:36 AM Hannah Jiang 
> wrote:
>
>> add dev@
>>
>> On Tue, Aug 27, 2019 at 9:29 PM Hannah Jiang 
>> wrote:
>>
>>> Thanks for commenting and discussions.
>>> I created a Google Docs
>>> 
>>>  for
>>> easy commenting and reviewing. From this moment, all changes will be
>>> updated to the Google Docs and I will sync to wiki after finalize all plans.
>>>
>>> Thanks,
>>> Hannah
>>>
>>> On Tue, Aug 27, 2019 at 9:24 PM Ahmet Altay  wrote:
>>>
 Hi datapls-engprod,

 I have a question. Do you know what would it take to create a new gcp
 project similar to apache-beam-testing for purposes of distributing gcr
 packages? We can use the same billing account.

 Hannah, Robert, depending on the complexity of creating another gcp
 project we can go with that, or simply create a new bintray account. Either
 way would give us a clean new project to publish artifacts.

 Ahmet

 -- Forwarded message -
 From: Robert Bradshaw 
 Date: Tue, Aug 27, 2019 at 6:48 PM
 Subject: Re: Improve container support
 To: dev 


 On Tue, Aug 27, 2019 at 6:20 PM Ahmet Altay  wrote:
 >
 > On Tue, Aug 27, 2019 at 5:50 PM Robert Bradshaw 
 wrote:
 >>
 >> On Tue, Aug 27, 2019 at 3:35 PM Hannah Jiang 
 wrote:
 >> >
 >> > Hi team
 >> >
 >> > I am working on improving docker container support for Beam. We
 would like to publish prebuilt containers for each release version and
 daily snapshot. Current work focuses on release images only and it would be
 part of the release process.
 >>
 >> This would be great!
 >>
 >> > The release images will be pushed to GCR which is publicly
 accessible(pullable). We will use the following locations.
 >> > Repository: gcr.io/beam
 >> > Project: apache-beam-testing
 >>
 >> Given that these are release artifacts, we should use a project with
 >> more restricted access than "anyone who opens a PR on github."
 >
 >
 > We have two options:
 > -  gcr.io works based on the permissions of the gcs bucket that is
 backing it. GCS supports bucket only permissions. These permissions needs
 to be explicitly granted and the service accounts used by jenkins jobs does
 not have these explicit permissions today.
 > - we can create a new project in gcr, bintray or anything else that
 offers the same service.

 I think the cleanest is to simply have a new project whose membership
 consists of (interested) PMC members. If we have to populate this
 manually I think that'd still be OK as the churn is quite low.

>>>
>
> --
> Cheers,
> Gleb
>


Re: Write-through-cache in State logic

2019-08-27 Thread Lukasz Cwik
Open up a PR for the proto changes and we can work through any minor
comments there.

On Tue, Aug 27, 2019 at 11:33 AM Maximilian Michels  wrote:

> Thanks. Updated:
>
> message ProcessBundleRequest {
>   // (Required) A reference to the process bundle descriptor that must be  // 
> instantiated and executed by the SDK harness.  string 
> process_bundle_descriptor_reference = 1;
>
>   // A cache token which can be used by an SDK to check for the validity  // 
> of cached elements which have a cache token associated.  message CacheToken {
>
> // A flag to indicate a cache token is valid for user state.message 
> UserState {}
>
> // A flag to indicate a cache token is valid for a side input.message 
> SideInput {
>   // The id of a side input.  string side_input = 1;
> }
>
> // The scope of a cache token.oneof type {
>   UserState user_state = 1;
>   SideInput side_input = 2;
> }
>
> // The cache token identifier which should be globally unique.bytes 
> token = 10;
>   }
>
>   // (Optional) A list of cache tokens that can be used by an SDK to reuse  
> // cached data returned by the State API across multiple bundles.  repeated 
> CacheToken cache_tokens = 2;
> }
>
> On 27.08.19 19:22, Lukasz Cwik wrote:
>
> SideInputState -> SideInput (side_input_state -> side_input)
> + more comments around the messages and the fields.
>
>
> On Tue, Aug 27, 2019 at 10:18 AM Maximilian Michels 
> wrote:
>
>> We would have to differentiate cache tokens for user state and side
>> inputs. How about something like this?
>>
>> message ProcessBundleRequest {
>>   // (Required) A reference to the process bundle descriptor that must be  
>> // instantiated and executed by the SDK harness.  string 
>> process_bundle_descriptor_reference = 1;
>>
>>   message CacheToken {
>>
>> message UserState {
>> }
>>
>> message SideInputState {
>>   string side_input_id = 1;
>> }
>>
>> oneof type {
>>   UserState user_state = 1;
>>   SideInputState side_input_state = 2;
>> }
>>
>> bytes token = 10;
>>   }
>>
>>   // (Optional) A list of cache tokens that can be used by an SDK to reuse  
>> // cached data returned by the State API across multiple bundles.  repeated 
>> CacheToken cache_tokens = 2;
>> }
>>
>> -Max
>>
>> On 27.08.19 18:43, Lukasz Cwik wrote:
>>
>> The bundles view of side inputs should never change during processing and
>> should have a point in time snapshot.
>>
>> I was just trying to say that the cache token for side inputs being
>> deferred till side input request time simplified the runners implementation
>> since that is conclusively when the runner would need to take a look at the
>> side input. Putting them as part of the ProcesBundleRequest complicates
>> that but does make the SDK implementation significantly simpler which is a
>> win.
>>
>> On Tue, Aug 27, 2019 at 9:14 AM Maximilian Michels 
>> wrote:
>>
>>> Thanks for the quick response.
>>>
>>> Just to clarify, the issue with versioning side input is also present
>>> when supplying the cache tokens on a request basis instead of per
>>> bundle. The SDK never knows when the Runner receives a new version of
>>> the side input. Like you pointed out, it needs to mark side inputs as
>>> stale and generate new cache tokens for the stale side inputs.
>>>
>>> The difference between per-request tokens and per-bundle tokens would be
>>> that the side input can only change after a bundle completes vs. during
>>> the bundle. Side inputs are always fuzzy in that regard because there is
>>> no precise instance where side inputs are atomically updated, other than
>>> the assumption that they eventually will be updated. In that regard
>>> per-bundle tokens for side input seem to be fine.
>>>
>>> All of the above is not an issue for user state, as its cache can remain
>>> valid for the lifetime of a Runner<=>SDK Harness connection. A simple
>>> solution would be to not cache side input because there are many cases
>>> where the caching just adds additional overhead. However, I can also
>>> imagine cases where side input is valid forever and caching would be
>>> very beneficial.
>>>
>>> For the first version I want to focus on user state because that's where
>>> I see the most benefit for caching. I don't see a problem though for the
>>> Runner to detect new side 

Re: Write-through-cache in State logic

2019-08-27 Thread Lukasz Cwik
SideInputState -> SideInput (side_input_state -> side_input)
+ more comments around the messages and the fields.


On Tue, Aug 27, 2019 at 10:18 AM Maximilian Michels  wrote:

> We would have to differentiate cache tokens for user state and side
> inputs. How about something like this?
>
> message ProcessBundleRequest {
>   // (Required) A reference to the process bundle descriptor that must be  // 
> instantiated and executed by the SDK harness.  string 
> process_bundle_descriptor_reference = 1;
>
>   message CacheToken {
>
> message UserState {
> }
>
> message SideInputState {
>   string side_input_id = 1;
> }
>
> oneof type {
>   UserState user_state = 1;
>   SideInputState side_input_state = 2;
> }
>
> bytes token = 10;
>   }
>
>   // (Optional) A list of cache tokens that can be used by an SDK to reuse  
> // cached data returned by the State API across multiple bundles.  repeated 
> CacheToken cache_tokens = 2;
> }
>
> -Max
>
> On 27.08.19 18:43, Lukasz Cwik wrote:
>
> The bundles view of side inputs should never change during processing and
> should have a point in time snapshot.
>
> I was just trying to say that the cache token for side inputs being
> deferred till side input request time simplified the runners implementation
> since that is conclusively when the runner would need to take a look at the
> side input. Putting them as part of the ProcesBundleRequest complicates
> that but does make the SDK implementation significantly simpler which is a
> win.
>
> On Tue, Aug 27, 2019 at 9:14 AM Maximilian Michels  wrote:
>
>> Thanks for the quick response.
>>
>> Just to clarify, the issue with versioning side input is also present
>> when supplying the cache tokens on a request basis instead of per
>> bundle. The SDK never knows when the Runner receives a new version of
>> the side input. Like you pointed out, it needs to mark side inputs as
>> stale and generate new cache tokens for the stale side inputs.
>>
>> The difference between per-request tokens and per-bundle tokens would be
>> that the side input can only change after a bundle completes vs. during
>> the bundle. Side inputs are always fuzzy in that regard because there is
>> no precise instance where side inputs are atomically updated, other than
>> the assumption that they eventually will be updated. In that regard
>> per-bundle tokens for side input seem to be fine.
>>
>> All of the above is not an issue for user state, as its cache can remain
>> valid for the lifetime of a Runner<=>SDK Harness connection. A simple
>> solution would be to not cache side input because there are many cases
>> where the caching just adds additional overhead. However, I can also
>> imagine cases where side input is valid forever and caching would be
>> very beneficial.
>>
>> For the first version I want to focus on user state because that's where
>> I see the most benefit for caching. I don't see a problem though for the
>> Runner to detect new side input and reflect that in the cache tokens
>> supplied for a new bundle.
>>
>> -Max
>>
>> On 26.08.19 22:27, Lukasz Cwik wrote:
>> > Your summary below makes sense to me. I can see that recovery from
>> > rolling back doesn't need to be a priority and simplifies the solution
>> > for user state caching down to one token.
>> >
>> > Providing cache tokens upfront does require the Runner to know what
>> > "version" of everything it may supply to the SDK upfront (instead of on
>> > request) which would mean that the Runner may need to have a mapping
>> > from cache token to internal version identifier for things like side
>> > inputs which are typically broadcast. The Runner would also need to
>> poll
>> > to see if the side input has changed in the background to not block
>> > processing bundles with "stale" side input data.
>> >
>> > Ping me once you have the Runner PR updated and I'll take a look again.
>> >
>> > On Mon, Aug 26, 2019 at 12:20 PM Maximilian Michels > > <mailto:m...@apache.org>> wrote:
>> >
>> > Thank you for the summary Luke. I really appreciate the effort you
>> put
>> > into this!
>> >
>> >  > Based upon your discussion you seem to want option #1
>> >
>> > I'm actually for option #2. The option to cache/invalidate side
>> inputs
>> > is important, and we should incorporate this in the design. That's
>> why
>> > option #1 is 

Re: Write-through-cache in State logic

2019-08-27 Thread Lukasz Cwik
The bundles view of side inputs should never change during processing and
should have a point in time snapshot.

I was just trying to say that the cache token for side inputs being
deferred till side input request time simplified the runners implementation
since that is conclusively when the runner would need to take a look at the
side input. Putting them as part of the ProcesBundleRequest complicates
that but does make the SDK implementation significantly simpler which is a
win.

On Tue, Aug 27, 2019 at 9:14 AM Maximilian Michels  wrote:

> Thanks for the quick response.
>
> Just to clarify, the issue with versioning side input is also present
> when supplying the cache tokens on a request basis instead of per
> bundle. The SDK never knows when the Runner receives a new version of
> the side input. Like you pointed out, it needs to mark side inputs as
> stale and generate new cache tokens for the stale side inputs.
>
> The difference between per-request tokens and per-bundle tokens would be
> that the side input can only change after a bundle completes vs. during
> the bundle. Side inputs are always fuzzy in that regard because there is
> no precise instance where side inputs are atomically updated, other than
> the assumption that they eventually will be updated. In that regard
> per-bundle tokens for side input seem to be fine.
>
> All of the above is not an issue for user state, as its cache can remain
> valid for the lifetime of a Runner<=>SDK Harness connection. A simple
> solution would be to not cache side input because there are many cases
> where the caching just adds additional overhead. However, I can also
> imagine cases where side input is valid forever and caching would be
> very beneficial.
>
> For the first version I want to focus on user state because that's where
> I see the most benefit for caching. I don't see a problem though for the
> Runner to detect new side input and reflect that in the cache tokens
> supplied for a new bundle.
>
> -Max
>
> On 26.08.19 22:27, Lukasz Cwik wrote:
> > Your summary below makes sense to me. I can see that recovery from
> > rolling back doesn't need to be a priority and simplifies the solution
> > for user state caching down to one token.
> >
> > Providing cache tokens upfront does require the Runner to know what
> > "version" of everything it may supply to the SDK upfront (instead of on
> > request) which would mean that the Runner may need to have a mapping
> > from cache token to internal version identifier for things like side
> > inputs which are typically broadcast. The Runner would also need to poll
> > to see if the side input has changed in the background to not block
> > processing bundles with "stale" side input data.
> >
> > Ping me once you have the Runner PR updated and I'll take a look again.
> >
> > On Mon, Aug 26, 2019 at 12:20 PM Maximilian Michels  > <mailto:m...@apache.org>> wrote:
> >
> > Thank you for the summary Luke. I really appreciate the effort you
> put
> > into this!
> >
> >  > Based upon your discussion you seem to want option #1
> >
> > I'm actually for option #2. The option to cache/invalidate side
> inputs
> > is important, and we should incorporate this in the design. That's
> why
> > option #1 is not flexible enough. However, a first implementation
> could
> > defer caching of side inputs.
> >
> > Option #3 was my initial thinking and the first version of the PR,
> but I
> > think we agreed that there wouldn't be much gain from keeping a cache
> > token per state id.
> >
> > Option #4 is what is specifically documented in the reference doc and
> > already part of the Proto, where valid tokens are provided for each
> new
> > bundle and also as part of the response of a get/put/clear. We
> mentioned
> > that the reply does not have to be waited on synchronously (I
> mentioned
> > it even), but it complicates the implementation. The idea Thomas and
> I
> > expressed was that a response is not even necessary if we assume
> > validity of the upfront provided cache tokens for the lifetime of a
> > bundle and that cache tokens will be invalidated as soon as the
> Runner
> > fails in any way. This is naturally the case for Flink because it
> will
> > simply "forget" its current cache tokens.
> >
> > I currently envision the following schema:
> >
> > Runner
> > ==
> >
> > - Runner generates a globally unique cache token, one for user state
> and
> > one for each sid

Re: Help triaging Jira issues

2019-08-27 Thread Lukasz Cwik
It still requests that we upgrade to Jira Pro.

For those who are interested, you can select "Project automation" from the
Beam Project settings page or use this direct link
https://issues.apache.org/jira/secure/AutomationProjectAdminAction!default.jspa?projectKey=BEAM

On Tue, Aug 27, 2019 at 5:20 AM Ismaël Mejía  wrote:

> Apache's JIRA was updated recently. Luke (or someone else maybe) can
> please help
> me check if we still cannot create the rule to self triage the issues that
> were
> created and have already an assigned contributor. This will easily reduce
> issue
> triage to half because due to JIRA's UI it is really easy to forget to
> self-triage the issues they create.
>
> Also a kind reminder to contributors and in particular current committers,
> when
> you create JIRAs, they are not automatically 'self-triaged' but in many
> cases
> they should be. Please take care to do the triage if it is already
> assigned on creation or if you judge it is complete enough but prefer to
> let it
> unassigned in case someone else can work on it. That will for sure reduce
> the
> triage work until this becomes automatic.
>
> On Wed, Jun 12, 2019 at 6:16 PM Lukasz Cwik  wrote:
> >
> > I looked at automating the two in JIRA but got the unhelpful:
> >
> > "You are using Automation Lite for Jira. This is the free offering of
> Automation for Jira Pro and only contains a small subset of the many
> awesome features of the paid app. For example, project admins like yourself
> can can only create and edit automation rules in the paid offering."
> >
> > On Wed, Jun 12, 2019 at 2:22 AM Ismaël Mejía  wrote:
> >>
> >> Kenn can you or someone else with more JIRA-fu than me automatize both
> cases (I
> >> just triaged most of the still untriaged issues and found multiple new
> >> instances of
> >> both cases).
> >>
> >> On Fri, Jun 7, 2019 at 10:27 PM Kenneth Knowles 
> wrote:
> >> >
> >> > Nice. I noticed the huge drop in untriaged issues. Both of those
> ideas for automation sound reasonable.
> >> >
> >> > I think the other things that are harder to optimize can probably be
> addressed by re-triaging stale bugs. We will probably find those that
> should have been closed and those that are just sitting on an inactive
> contributor.
> >> >
> >> > Kenn
> >> >
> >> > On Fri, Jun 7, 2019 at 12:53 AM Ismaël Mejía 
> wrote:
> >> >>
> >> >> I took a look and reduced the untriaged issues to around 100. I
> >> >> noticed however some patterns that are producing more untriaged
> issues
> >> >> that we should have. Those can be probably automated (if JIRA has
> ways
> >> >> to do it):
> >> >>
> >> >> 1. Issues created and assigned on creation can be marked as open.
> >> >> 2. Once an issue has an associated PR it could be marked as open if
> it
> >> >> was in Triaged state.
> >> >>
> >> >> Other common case that is probably harder to automate are issues that
> >> >> are in Triaged state because we forgot to resolve/close them. I don’t
> >> >> know how we can improve these, apart of reminding people to look that
> >> >> they do not have untriaged assigned issues.
> >> >>
> >> >> Another interesting triage to do are the issues that are Open and
> >> >> assigned to members of the community that are not active anymore in
> >> >> the project, but that’s probably worth of another discussion, as well
> >> >> as how can we more effectively track open unassigned issues (which
> are
> >> >> currently around 1600).
> >> >>
> >> >> On Wed, Jun 5, 2019 at 7:03 PM Tanay Tummalapalli <
> ttanay...@gmail.com> wrote:
> >> >> >
> >> >> > Hi Kenneth,
> >> >> >
> >> >> > I already follow the issues@ mailing list pretty much daily.
> >> >> > I'd like to help with triaging issues, especially ones related to
> the Python SDK since I'm most familiar with it.
> >> >> >
> >> >> > On Wed, Jun 5, 2019 at 10:26 PM Alex Van Boxel 
> wrote:
> >> >> >>
> >> >> >> Hey Kenneth, I help out. I'm planning to contribute more on Beam
> and it seems to be ideal to keep up-to-date with the project.
> >> >> >>
> >> >> >>  _/
> >> >> >> _/ Alex Van Boxel
> >> >> >>
>

Re: Write-through-cache in State logic

2019-08-26 Thread Lukasz Cwik
Your summary below makes sense to me. I can see that recovery from rolling
back doesn't need to be a priority and simplifies the solution for user
state caching down to one token.

Providing cache tokens upfront does require the Runner to know what
"version" of everything it may supply to the SDK upfront (instead of on
request) which would mean that the Runner may need to have a mapping from
cache token to internal version identifier for things like side inputs
which are typically broadcast. The Runner would also need to poll to see if
the side input has changed in the background to not block processing
bundles with "stale" side input data.

Ping me once you have the Runner PR updated and I'll take a look again.

On Mon, Aug 26, 2019 at 12:20 PM Maximilian Michels  wrote:

> Thank you for the summary Luke. I really appreciate the effort you put
> into this!
>
> > Based upon your discussion you seem to want option #1
>
> I'm actually for option #2. The option to cache/invalidate side inputs
> is important, and we should incorporate this in the design. That's why
> option #1 is not flexible enough. However, a first implementation could
> defer caching of side inputs.
>
> Option #3 was my initial thinking and the first version of the PR, but I
> think we agreed that there wouldn't be much gain from keeping a cache
> token per state id.
>
> Option #4 is what is specifically documented in the reference doc and
> already part of the Proto, where valid tokens are provided for each new
> bundle and also as part of the response of a get/put/clear. We mentioned
> that the reply does not have to be waited on synchronously (I mentioned
> it even), but it complicates the implementation. The idea Thomas and I
> expressed was that a response is not even necessary if we assume
> validity of the upfront provided cache tokens for the lifetime of a
> bundle and that cache tokens will be invalidated as soon as the Runner
> fails in any way. This is naturally the case for Flink because it will
> simply "forget" its current cache tokens.
>
> I currently envision the following schema:
>
> Runner
> ==
>
> - Runner generates a globally unique cache token, one for user state and
> one for each side input

- The token is supplied to the SDK Harness for each bundle request

- For the lifetime of a Runner<=>SDK Harness connection this cache token
> will not change
> - Runner will generate a new token if the connection/key space changes
> between Runner and SDK Harness


> SDK
> ===
>
> - For each bundle the SDK worker stores the list of valid cache tokens
> - The SDK Harness keep a global cache across all its (local) workers
> which is a LRU cache: state_key => (cache_token, value)
> - get: Lookup cache using the valid cache token for the state. If no
> match, then fetch from Runner and use the already available token for
> caching
> - put: Put value in cache with a valid cache token, put value to pending
> writes which will be flushed out latest when the bundle ends
> - clear: same as put but clear cache
>
> It does look like this is not too far off from what you were describing.
> The main difference is that we just work with a single cache token. In
> my opinion we do not need the second cache token for writes, as long as
> we ensure that we generate a new cache token if the bundle/checkpoint
> fails.
>
> I have a draft PR
>   for the Runner: https://github.com/apache/beam/pull/9374
>   for the SDK: https://github.com/apache/beam/pull/9418
>
> Note that the Runner PR needs to be updated to fully reflected the above
> scheme. The SDK implementation is WIP. I want to make sure that we
> clarify the design before this gets finalized.
>
> Thanks again for all your comments. Much appreciated!
>
> Cheers,
> Max
>
> On 26.08.19 19:58, Lukasz Cwik wrote:
> > There were originally a couple of ideas around how caching could work:
> > 1) One cache token for the entire bundle that is supplied up front. The
> > SDK caches everything using the given token. All reads/clear/append for
> > all types of state happen under this token. Anytime a side input
> > changes, key processing partition range changes or a bundle fails to
> > process, the runner chooses a new cache token effectively invalidating
> > everything in the past>
> > 2) One cache token per type of state that is supplied up front.
> > The SDK caches all requests for a given type using the given cache
> > token. The runner can selectively choose which type to keep and which to
> > invalidate. Bundle failure and key processing partition changes
> > invalidate all user state, side input change invalidates all side inputs.
> >
> > 3) One cache token per 

Re: Write-through-cache in State logic

2019-08-26 Thread Lukasz Cwik
e:
> >>
> >>
> >> On Wed, Aug 21, 2019 at 2:16 AM Maximilian Michels  >> <mailto:m...@apache.org>> wrote:
> >>
> >> Appreciate all your comments! Replying below.
> >>
> >>
> >> @Luke:
> >>
> >> > Having cache tokens per key would be very expensive indeed and I
> >> believe we should go with a single cache token "per" bundle.
> >>
> >> Thanks for your comments on the PR. I was thinking to propose
> something
> >> along this lines of having cache tokens valid for a particular
> >> checkpointing "epoch". That would require even less token renewal
> than
> >> the per-bundle approach.
> >>
> >>
> >> @Thomas, thanks for the input. Some remarks:
> >>
> >> > Wouldn't it be simpler to have the runner just track a unique ID
> >> for each worker and use that to communicate if the cache is valid or
> >> not?
> >>
> >> We do not need a unique id per worker. If a cache token is valid
> for a
> >> particular worker, it is also valid for another worker. That is
> with the
> >> assumption that key ranges are always disjoint between the workers.
> >>
> >> > * When the bundle is started, the runner tells the worker if the
> >> cache has become invalid (since it knows if another worker has
> >> mutated state)
> >>
> >> This is simply done by not transferring the particular cache token.
> No
> >> need to declare it invalid explicitly.
> >>
> >> > * When the worker sends mutation requests to the runner, it
> >> includes its own ID (or the runner already has it as contextual
> >> information). No need to wait for a response.
> >>
> >> Mutations of cached values can be freely done as long as the cache
> token
> >> associated with the state is valid for a particular bundle. Only the
> >> first time, the Runner needs to wait on the response to store the
> cache
> >> token. This can also be done asynchronously.
> >>
> >> > * When the bundle is finished, the runner records the last writer
> >> (only if a change occurred)
> >>
> >> I believe this is not necessary because there will only be one
> writer at
> >> a time for a particular bundle and key range, hence only one writer
> >> holds a valid cache token for a particular state and key range.
> >>
> >>
> >> @Reuven:
> >>
> >> >  Dataflow divides the keyspace up into lexicographic ranges, and
> >> creates a cache token per range.
> >>
> >> State is always processed partitioned by the Flink workers
> (hash-based,
> >> not lexicopgrahical). I don't think that matters though because the
> key
> >> ranges do not overlap between the workers. Flink does not support
> >> dynamically repartitioning the key ranges. Even in case of
> fine-grained
> >> recovery of workers and their key ranges, we would simply generate
> new
> >> cache tokens for a particular worker.
> >>
> >>
> >> Dataflow's ranges are also hash based. When I said lexicographical, I
> >> meant lexicographical based on the hexadecimal hash value.
> >>
> >> Indeed the fact that Dataflow can dynamically split and merge these
> >> ranges is what makes it trickier. If Flink does not repartition the
> >> ranges, then things are much easier.
> >>
> >>
> >>
> >> Thanks,
> >> Max
> >>
> >> On 21.08.19 09:33, Reuven Lax wrote:
> >> > Dataflow does something like this, however since work is
> >> > load balanced across workers a per-worker id doesn't work very
> well.
> >> > Dataflow divides the keyspace up into lexicographic ranges, and
> >> creates
> >> > a cache token per range.
> >> >
> >> > On Tue, Aug 20, 2019 at 8:35 PM Thomas Weise  >> <mailto:t...@apache.org>
> >> > <mailto:t...@apache.org <mailto:t...@apache.org>>> wrote:
> >> >
> >> > Commenting here vs. on the PR since related to the overall
> >> approach.
> >> >
> >> > Wouldn't it be simpler to have the runner just track a unique
> >> ID for
> >> >

Re: [Discuss] Propose Calcite Vendor Release

2019-08-22 Thread Lukasz Cwik
+1 for release

On Thu, Aug 22, 2019 at 8:20 AM Kenneth Knowles  wrote:

> +1 to doing this release. There is no risk since nothing will use the 0.1
> version and if it has problems we just make 0.2, etc, etc.
>
> And big thanks to Rui for volunteering.
>
> On Wed, Aug 21, 2019 at 11:11 PM Kai Jiang  wrote:
>
>> Thanks Rui! For sure, any objections should be resolved before releasing.
>>
>> On Wed, Aug 21, 2019 at 10:24 PM Rui Wang  wrote:
>>
>>> I can be the release manager to help release vendor calcite. Per [1],
>>> before we start a release, we have to reach consensus before starting a
>>> release.
>>>
>>>
>>> [1]: https://s.apache.org/beam-release-vendored-artifacts
>>>
>>> -Rui
>>>
>>> On Wed, Aug 21, 2019 at 5:00 PM Kai Jiang  wrote:
>>>
 Hi Community,

 As a part of effort to unblock for vendor calcite in SQL module, we
 broke it into pull/9333  for
 going through vendored dependencies release process separately.

 I want to propose Calcite vendor release and look for a release manager
 to help with the release process.

 Best,
 Kai

>>>


Re: [VOTE] Release 2.15.0, release candidate #2

2019-08-21 Thread Lukasz Cwik
+1 (binding)

I validated the signatures against the key dist/release/KEYS and hashes of
the source distributions and release artifacts.
I also ran some of the quickstarts for Java.

On Tue, Aug 20, 2019 at 3:59 PM Pablo Estrada  wrote:

> +1
>
> I've installed from the source in  apache/dist.
> I've run unit tests in Python 3.6, and wordcount in Python 3.6 in Direct
> and Dataflow runners.
>
> Thanks!
> -P.
>
> On Tue, Aug 20, 2019 at 11:41 AM Hannah Jiang 
> wrote:
>
>> Yes, I agree this is a separate topic and shouldn't block 2.15 release.
>> There is already a JIRA ticket, I will update it with more details.
>>
>> On Tue, Aug 20, 2019 at 11:32 AM Ahmet Altay  wrote:
>>
>>>
>>>
>>> On Tue, Aug 20, 2019 at 10:18 AM Yifan Zou  wrote:
>>>
 Hi all,

 This is a friendly reminder. Please help to review, verify and vote on
 the release candidate #2 for the version 2.15.0.
 [ ] +1, Approve the release
 [ ] -1, Do not approve the release (please provide specific comments)

 I've verified Java quickstart & mobile games, and Python (both tar and
 wheel) quickstart with Py27, 35, 36, 37. They worked well.

 https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit?pli=1#gid=1036192804

 Thanks.
 Yifan



 On Tue, Aug 20, 2019 at 9:33 AM Hannah Jiang 
 wrote:

> A side note about this test:
> Now we only have py2 and py35, so it only fails with py35. I am
> introducing minor versions, which will add py36 and py37, and all py3 are
> flaky.
> It's really difficult to pass Portable Precommit with minor versions,
> the chance of passing the test is around 15%.
>

>>> Hannah, let's separate this from the release thread. Is there a JIRA for
>>> this, could you update it? And perhaps we need different pre commits for
>>> different versions so that flakes do not stack up. Even if a suite is >90%
>>> reliable, if we stack up with 4 version, the reliability will get much
>>> lower.
>>>
>>>

> On Mon, Aug 19, 2019 at 5:27 PM Ahmet Altay  wrote:
>
>> Thank you. Unless there are any objects, let's continue with
>> validating RC2.
>>
>> On Mon, Aug 19, 2019 at 5:21 PM Kyle Weaver 
>> wrote:
>>
>>> I'm not sure if it's worth blocking the release, since I can't
>>> reproduce the issue on my machine and a fix would be hard to verify.
>>>
>>> Kyle Weaver | Software Engineer | github.com/ibzib |
>>> kcwea...@google.com
>>>
>>>
>>> On Mon, Aug 19, 2019 at 4:56 PM Ahmet Altay 
>>> wrote:
>>>
 Kyle, are you currently working on this to decide whether it is the
 blocking case or not? Also is this affecting both release branch and 
 master
 branch?

 On Mon, Aug 19, 2019 at 4:49 PM Kyle Weaver 
 wrote:

> Re BEAM-7993: For some context, there are two possible causes
> here. The pessimistic take is that Dockerized SDK workers are taking
> forever to start. The optimistic take is that the Docker containers 
> are
> just longer than normal (but not forever) to start on Jenkins, in 
> which
> case this issue is nothing new to this release. (The halting 
> problem!) If
> it's the second, it's safe to wait to fix it in the next release.
>
> Kyle Weaver | Software Engineer | github.com/ibzib |
> kcwea...@google.com
>
>
> On Mon, Aug 19, 2019 at 4:42 PM Yifan Zou 
> wrote:
>
>> Mark and Kyle found a py35 portable test which is flaky:
>> https://issues.apache.org/jira/browse/BEAM-7993.
>> I plan to finalize the release this week. Would that be a
>> blocker? Could we include the fix in 2.16?
>>
>> Thanks.
>>
>>
>> On Mon, Aug 19, 2019 at 4:38 PM Yifan Zou 
>> wrote:
>>
>>> I've run most of validations and they're all good.
>>> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit?pli=1#gid=1036192804
>>>
>>> On Mon, Aug 19, 2019 at 10:59 AM Hannah Jiang <
>>> hannahji...@google.com> wrote:
>>>
 (resending it to dev@)
 +1, I tested some test cases as well as customized test cases
 and all looks good. I updated validation sheet.

 On Mon, Aug 19, 2019 at 10:40 AM Hannah Jiang <
 hannahji...@google.com> wrote:

> +1, I tested some test cases as well as customized test cases
> and all looks good. I updated validation sheet.
>
> On Mon, Aug 19, 2019 at 10:28 AM Ahmet Altay 
> wrote:
>
>> Hi all,
>>
>> Please help with validation and voting on the RC2. Let's help
>> Yifan to 

Re: (mini-doc) Beam (Flink) portable job templates

2019-08-20 Thread Lukasz Cwik
On Mon, Aug 19, 2019 at 5:52 PM Ahmet Altay  wrote:

>
>
> On Sun, Aug 18, 2019 at 12:34 PM Thomas Weise  wrote:
>
>> There is a PR open for this: https://github.com/apache/beam/pull/9331
>>
>> (it wasn't tagged with the JIRA and therefore not linked)
>>
>> I think it is worthwhile to explore how we could further detangle the
>> client side Python and Java dependencies.
>>
>> The expansion service is one more dependency to consider in a build
>> environment. Is it really necessary to expand external transforms prior to
>> submission to the job service?
>>
>
> +1, this will make it easier to use external transforms from the already
> familiar client environments.
>
>

The intent is to make it so that you CAN (not MUST) run an expansion
service separate from a Runner. Creating a single endpoint that hosts both
the Job and Expansion service is something that gRPC does very easily since
you can host multiple service definitions on a single port.


>
>> Can we come up with a partially constructed proto that can be produced by
>> just running the Python entry point? Note this would also require pushing
>> the pipeline options parsing into the job service.
>>
>
> Why would this require pushing the pipeline options parsing to the job
> service. Assuming that python will have enough idea about the external
> transform what options it will need. The necessary bit could be converted
> to arguments and be part of that partially constructed proto.
>
>
>>
>> On Sun, Aug 18, 2019 at 12:01 PM enrico canzonieri 
>> wrote:
>>
>>> I found the tracking ticket at BEAM-7966
>>> 
>>>
>>> On Sun, Aug 18, 2019 at 11:59 AM enrico canzonieri <
>>> ecanzoni...@gmail.com> wrote:
>>>
 Is this alternative still being considered? Creating a portable jar
 sounds like a good solution to re-use the existing runner specific
 deployment mechanism (e.g. Flink k8s operator) and in general simplify the
 deployment story.

 On Fri, Aug 9, 2019 at 12:46 AM Robert Bradshaw 
 wrote:

> The expansion service is a separate service. (The flink jar happens to
> bring both up.) However, there is negotiation to receive/validate the
> pipeline options.
>
> On Fri, Aug 9, 2019 at 1:54 AM Thomas Weise  wrote:
> >
> > We would also need to consider cross-language pipelines that
> (currently) assume the interaction with an expansion service at
> construction time.
> >
> > On Thu, Aug 8, 2019, 4:38 PM Kyle Weaver 
> wrote:
> >>
> >> > It might also be useful to have the option to just output the
> proto and artifacts, as alternative to the jar file.
> >>
> >> Sure, that wouldn't be too big a change if we were to decide to go
> the SDK route.
> >>
> >> > For the Flink entry point we would need to allow for the job
> server to be used as a library.
> >>
> >> We don't need the whole job server, we only need to add a main
> method to FlinkPipelineRunner [1] as the entry point, which would 
> basically
> just do the setup described in the doc then call FlinkPipelineRunner::run.
> >>
> >> [1]
> https://github.com/apache/beam/blob/master/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkPipelineRunner.java#L53
> >>
> >> Kyle Weaver | Software Engineer | github.com/ibzib |
> kcwea...@google.com
> >>
> >>
> >> On Thu, Aug 8, 2019 at 4:21 PM Thomas Weise  wrote:
> >>>
> >>> Hi Kyle,
> >>>
> >>> It might also be useful to have the option to just output the
> proto and artifacts, as alternative to the jar file.
> >>>
> >>> For the Flink entry point we would need to allow for the job
> server to be used as a library. It would probably not be too hard to have
> the Flink job constructed via the context execution environment, which
> would require no changes on the Flink side.
> >>>
> >>> Thanks,
> >>> Thomas
> >>>
> >>>
> >>> On Thu, Aug 8, 2019 at 9:52 AM Kyle Weaver 
> wrote:
> 
>  Re Javaless/serverless solution:
>  I take it this would probably mean that we would construct the
> jar directly from the SDK. There are advantages to this: full separation 
> of
> Python and Java environments, no need for a job server, and likely a
> simpler implementation, since we'd no longer have to work within the
> constraints of the existing job server infrastructure. The only downside I
> can think of is the additional cost of implementing/maintaining jar
> creation code in each SDK, but that cost may be acceptable if it's simple
> enough.
> 
>  Kyle Weaver | Software Engineer | github.com/ibzib |
> kcwea...@google.com
> 
> 
>  On Thu, Aug 8, 2019 at 9:31 AM Thomas Weise 
> wrote:
> >
> >
> >
> > On Thu, Aug 8, 2019 at 8:29 AM Robert Bradshaw <
> 

Re: Java serialization for coders and compatibility

2019-08-13 Thread Lukasz Cwik
Coders such as AvroCoder are translated to an intermediate JSON form called
a CloudObject[1].
Dataflow only uses the serialized Java representation (embedded as bytes in
?base64? within the CloudObject) for coders which extend
SerializableCoder[2].
Dataflow only cares that these CloudObject representations didn't change.

For Beam runners in general today, they could rely on the proto version of
the coder (which is meant to replace the CloudObject that Dataflow uses
eventually). Relying on not breaking Java serialization makes changing
coders too strict.

Eventually we could be using schemas for most things and then the
representation will be completely definable and separate from the
implementation.

There might be more details in the snapshotting and update design doc[3]
but I believe it was pretty light on how to deal with these kinds of
changes beyond that we know its a problem.

1:
https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/util/AvroCoderCloudObjectTranslator.java
2:
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/SerializableCoder.java
3:
https://docs.google.com/document/d/1UWhnYPgui0gUYOsuGcCjLuoOUlGA4QaY91n8p3wz9MYhttps://docs.google.com/document/d/1UWhnYPgui0gUYOsuGcCjLuoOUlGA4QaY91n8p3wz9MY



On Tue, Aug 13, 2019 at 1:13 PM Gleb Kanterov  wrote:

> I'm looking into the code of AvroCoder, and I was wondering what happens
> when users upgrade Beam for streaming pipelines?
>
> As I understand it, we should be able to deserialize coder from previous
> Beam version. Looking into guava vendoring, it's going to break
> serialization when we are going to switch guava version because the current
> version is a part of the namespace:
>
> import
> org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Supplier;
> import
> org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Suppliers;
>
> We don't have tests for it, but probably we already broke compatibility
> when we vendored guava. Can anybody clarify what would be the approach for
> coders?
>


Re: [VOTE] Support ZetaSQL as another SQL dialect for BeamSQL in Beam repo

2019-08-13 Thread Lukasz Cwik
+1

On Tue, Aug 13, 2019 at 9:09 AM Andrew Pilloud  wrote:

> +1
> I also hope this can move to Calcite.
>
> On Tue, Aug 13, 2019 at 2:40 AM Gleb Kanterov  wrote:
>
>> +1
>>
>> On Tue, Aug 13, 2019 at 10:47 AM Ismaël Mejía  wrote:
>>
>>> +1
>>> Wishing that this goes to calcite too someday (hoping that it makes
>>> Beam side maintenance simpler)
>>>
>>> On Tue, Aug 13, 2019 at 6:18 AM Manu Zhang 
>>> wrote:
>>> >
>>> > +1
>>> >
>>> > On Tue, Aug 13, 2019 at 11:55 AM Mingmin Xu 
>>> wrote:
>>> >>
>>> >> +1
>>> >>
>>> >> On Mon, Aug 12, 2019 at 8:53 PM Ryan McDowell <
>>> ryanmcdow...@google.com> wrote:
>>> >>>
>>> >>> +1
>>> >>>
>>> >>> On Mon, Aug 12, 2019 at 8:30 PM Reza Rokni  wrote:
>>> 
>>>  +1
>>> 
>>>  On Tue, 13 Aug 2019 at 09:28, Ahmet Altay  wrote:
>>> >
>>> > +1
>>> >
>>> > On Mon, Aug 12, 2019 at 6:27 PM Kenneth Knowles 
>>> wrote:
>>> >>
>>> >> +1
>>> >>
>>> >> On Mon, Aug 12, 2019 at 4:43 PM Rui Wang 
>>> wrote:
>>> >>>
>>> >>> Hi Community,
>>> >>>
>>> >>> I am using this separate thread to collect votes on contributing
>>> Beam ZetaSQL(my way to say ZetaSQL as a dialect supported by BeamSQL) to
>>> Beam repo.
>>> >>>
>>> >>> There are discussions related to benefits, technical design and
>>> others on Beam ZetaSQL in [1]. The Beam ZetaSQL code lives in [2]. Note
>>> that this vote is not about merging the PR, which should be decided by code
>>> review. This vote is only to vote if Beam ZetaSQL should live in Beam repo.
>>> >>>
>>> >>> +1: Beam repo can host Beam ZetaSQL
>>> >>> -1: Beam repo should not host Beam ZetaSQL
>>> >>>
>>> >>> If there are more questions related to Beam ZetaSQL, please
>>> discuss it in [1].
>>> >>>
>>> >>> [1]:
>>> https://lists.apache.org/thread.html/eab70bb99218aaedfd506e979967379c0efa05ea56a792a1486f9c74@%3Cdev.beam.apache.org%3E
>>> >>> [2]: https://github.com/apache/beam/pull/9210
>>> >>>
>>> >>> -Rui
>>> 
>>> 
>>> 
>>>  --
>>> 
>>>  This email may be confidential and privileged. If you received this
>>> communication by mistake, please don't forward it to anyone else, please
>>> erase all copies and attachments, and please let me know that it has gone
>>> to the wrong person.
>>> 
>>>  The above terms reflect a potential business arrangement, are
>>> provided solely as a basis for further discussion, and are not intended to
>>> be and do not constitute a legally binding obligation. No legally binding
>>> obligations will be created, implied, or inferred until an agreement in
>>> final form is executed in writing by all parties involved.
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> 
>>> >> Mingmin
>>>
>>
>>
>> --
>> Cheers,
>> Gleb
>>
>


Re: Write-through-cache in State logic

2019-08-13 Thread Lukasz Cwik
On Tue, Aug 13, 2019 at 4:36 AM Maximilian Michels  wrote:

> Agree that we have to be able to flush before a checkpoint to avoid
> caching too many elements. Also good point about checkpoint costs
> increasing with flushing the cache on checkpoints. A LRU cache policy in
> the SDK seems desirable.
>
> What is the role of the cache token in the design document[1]? It looks
> to me that the token is used to give the Runner control over which and
> how many elements can be cached by the SDK. Why is that necessary?
> Shouldn't this be up to the SDK?
>

We want to be able to handle the case where the SDK completes the bundle
successfully but the runner fails to checkpoint the information.
We also want the runner to be able to pass in cache tokens for things like
side inputs which may change over time (and the SDK would not know that
this happened).


> -Max
>
> [1]
>
> https://docs.google.com/document/d/1BOozW0bzBuz4oHJEuZNDOHdzaV5Y56ix58Ozrqm2jFg/edit#heading=h.7ghoih5aig5m
>
> Is it simply to
> On 12.08.19 19:55, Lukasz Cwik wrote:
> >
> >
> > On Mon, Aug 12, 2019 at 10:09 AM Thomas Weise  > <mailto:t...@apache.org>> wrote:
> >
> >
> > On Mon, Aug 12, 2019 at 8:53 AM Maximilian Michels  > <mailto:m...@apache.org>> wrote:
> >
> > Thanks for starting this discussion Rakesh. An efficient cache
> > layer is
> > one of the missing pieces for good performance in stateful
> > pipelines.
> > The good news are that there is a level of caching already
> > present in
> > Python which batches append requests until the bundle is
> finished.
> >
> > Thomas, in your example indeed we would have to profile to see
> > why CPU
> > utilization is high on the Flink side but not in the Python SDK
> > harness.
> > For example, older versions of Flink (<=1.5) have a high cost of
> > deleting existing instances of a timer when setting a timer.
> > Nevertheless, cross-bundle caching would likely result in
> increased
> > performance.
> >
> >
> > CPU on the Flink side was unchanged, and that's important. The
> > throughout improvement comes from the extended bundle caching on the
> > SDK side. That's what tells me that cross-bundle caching is needed.
> > Of course, it will require a good solution for the write also and I
> > like your idea of using the checkpoint boundary for that, especially
> > since that already aligns with the bundle boundary and is under
> > runner control. Of course we also want to be careful to not cause
> > overly bursty writes.
> >
> > Profiling will be useful for the timer processing, that is also on
> > my list of suspects.
> >
> >
> > Luke, I think the idea to merge pending state requests could be
> > complementary to caching across bundles.
> >
> > Question: Couldn't we defer flushing back state from the SDK to
> the
> > Runner indefinitely, provided that we add a way to flush the
> > state in
> > case of a checkpoint?
> >
> >
> > Flushing is needed to prevent the SDK from running out of memory. Having
> > a fixed budget for state inside the SDK would have flushing happen under
> > certain state usage scenarios.
> > I could also see that only flushing at checkpoint may lead to slow
> > checkpoint performance so we may want to flush state that hasn't been
> > used in a while as well.
> >
> >
> > Another performance improvement would be caching read requests
> > because
> > these first go to the Runner regardless of already cached
> appends.
> >
> > -Max
> >
> > On 09.08.19 17:12, Lukasz Cwik wrote:
> > >
> > >
> > > On Fri, Aug 9, 2019 at 2:32 AM Robert Bradshaw
> > mailto:rober...@google.com>
> > > <mailto:rober...@google.com <mailto:rober...@google.com>>>
> wrote:
> > >
> > > The question is whether the SDK needs to wait for the
> > StateResponse to
> > > come back before declaring the bundle done. The proposal
> > was to not
> > > send the cache token back as part of an append
> > StateResponse [1], but
> > > pre-provide it as part of the bundle request.
> > >
> > >
> > > Agree, the purpose of the I'm

Re: Write-through-cache in State logic

2019-08-12 Thread Lukasz Cwik
On Mon, Aug 12, 2019 at 10:09 AM Thomas Weise  wrote:

>
> On Mon, Aug 12, 2019 at 8:53 AM Maximilian Michels  wrote:
>
>> Thanks for starting this discussion Rakesh. An efficient cache layer is
>> one of the missing pieces for good performance in stateful pipelines.
>> The good news are that there is a level of caching already present in
>> Python which batches append requests until the bundle is finished.
>>
>> Thomas, in your example indeed we would have to profile to see why CPU
>> utilization is high on the Flink side but not in the Python SDK harness.
>> For example, older versions of Flink (<=1.5) have a high cost of
>> deleting existing instances of a timer when setting a timer.
>> Nevertheless, cross-bundle caching would likely result in increased
>> performance.
>>
>
> CPU on the Flink side was unchanged, and that's important. The throughout
> improvement comes from the extended bundle caching on the SDK side. That's
> what tells me that cross-bundle caching is needed. Of course, it will
> require a good solution for the write also and I like your idea of using
> the checkpoint boundary for that, especially since that already aligns with
> the bundle boundary and is under runner control. Of course we also want to
> be careful to not cause overly bursty writes.
>
> Profiling will be useful for the timer processing, that is also on my list
> of suspects.
>
>
>> Luke, I think the idea to merge pending state requests could be
>> complementary to caching across bundles.
>>
>> Question: Couldn't we defer flushing back state from the SDK to the
>> Runner indefinitely, provided that we add a way to flush the state in
>> case of a checkpoint?
>>
>
Flushing is needed to prevent the SDK from running out of memory. Having a
fixed budget for state inside the SDK would have flushing happen under
certain state usage scenarios.
I could also see that only flushing at checkpoint may lead to slow
checkpoint performance so we may want to flush state that hasn't been used
in a while as well.


> Another performance improvement would be caching read requests because
>> these first go to the Runner regardless of already cached appends.
>>
>> -Max
>>
>> On 09.08.19 17:12, Lukasz Cwik wrote:
>> >
>> >
>> > On Fri, Aug 9, 2019 at 2:32 AM Robert Bradshaw > > <mailto:rober...@google.com>> wrote:
>> >
>> > The question is whether the SDK needs to wait for the StateResponse
>> to
>> > come back before declaring the bundle done. The proposal was to not
>> > send the cache token back as part of an append StateResponse [1],
>> but
>> > pre-provide it as part of the bundle request.
>> >
>> >
>> > Agree, the purpose of the I'm Blocked message is to occur during bundle
>> > processing.
>> >
>> >
>> > Thinking about this some more, if we assume the state response was
>> > successfully applied, there's no reason for the SDK to block the
>> > bundle until it has its hands on the cache token--we can update the
>> > cache once the StateResponse comes back whether or not the bundle is
>> > still active. On the other hand, the runner needs a way to assert it
>> > has received and processed all StateRequests from the SDK associated
>> > with a bundle before it can declare the bundle complete (regardless
>> of
>> > the cache tokens), so this might not be safe without some extra
>> > coordination (e.g. the ProcessBundleResponse indicating the number
>> of
>> > state requests associated with a bundle).
>> >
>> >
>> > Since the state request stream is ordered, we can add the id of the last
>> > state request as part of the ProcessBundleResponse.
>> >
>> >
>> > [1]
>> >
>> https://github.com/apache/beam/blob/release-2.14.0/model/fn-execution/src/main/proto/beam_fn_api.proto#L627
>> >
>> > On Thu, Aug 8, 2019 at 6:57 PM Lukasz Cwik > > <mailto:lc...@google.com>> wrote:
>> > >
>> > > The purpose of the new state API call in BEAM-7000 is to tell the
>> > runner that the SDK is now blocked waiting for the result of a
>> > specific state request and it should be used for fetches (not
>> > updates) and is there to allow for SDKs to differentiate readLater
>> > (I will need this data at some point in time in the future) from
>> > read (I need this data now). This comes up commonly where the user
>> > prefetch

Re: Docker Run Options in SDK Container

2019-08-09 Thread Lukasz Cwik
On Fri, Aug 2, 2019 at 11:00 AM Chad Dombrova  wrote:

> Hi all,
> I’m a bit confused about the desire to use json for the environment_config.
>
Note that complex PipelineOptions are already expected to be in JSON
format[1, 2]. This has solved many string parsing and ambiguity issues.

> It’s harder to use json on the command line, such that now we’re talking
> about the value being *either* a docker image name *or* a path to a json
> file (OR maybe yaml too!), which is not only less convenient than just
> typing the docker ags you want, it's also IMHO a dirty/inconsistent design.
>
> The idea of having each key of the json config map to a docker flag seems
> like a maintenance quagmire with little benefit.  In that case, Beam devs
> would have to maintain parity with docker options and be the arbiters of
> what's "safe" and what's not, and users would have to read additional beam
> documentation (which may or may not exist) to discover what keys are valid,
> rather than simply doing what they know, and passing the docker args.  As
> Sam points out, if security is a concern there are plenty of ways to abuse
> the system already. Security should be handled at the infrastructure
> deployment level, where it’s actually meaningful.
>
Wouldn't supporting every possible docker option also be a backwards
compatibility and portability quagmire?
Users could say that option X worked with Beam Y but no longer with Y+1 or
with runner A since it used docker Z+1 but not with runner B because it
uses docker Z.

Both options have tradeoffs and the important part is whether the
convenience of specifying all options available to users via docker run
outweigh the drawbacks.

> It also seems like there’s already a precedent for encoding environment
> configuration as command line args. Consider the SUBPROCESS_SDK environment:
>
> options = PipelineOptions()
> options.view_as(PortableOptions).environment_type = \
> python_urns.SUBPROCESS_SDK
> options.view_as(PortableOptions).environment_config = \
> b'/usr/bin/python -m apache_beam.runners.worker.sdk_worker_main'
>
> This could be encoded as json to avoid someone passing something nasty,
> but luckily that was *not* the choice that was made, because I think this
> is a fine design.
>
> As a result, I think the original proposal was the most elegant and
> consistent with other environment types:
>
> --environment_type DOCKER --environment_config "-v 
> /Volumes/mnt/foo:/Volumes/mnt/foo --user sambvfx MY_CONTAINER_NAME"
>
>
> Note that the help docs for --environment_config heavily suggest that the
intent for the command was always JSON:
Set environment configuration for running the user code. For DOCKER: Url
for the docker image.\n For PROCESS: json of the form {"os": "",
"arch": "", "command": "",
"env":{"": ""} }. All fields in the json
are optional except command.

1:
https://github.com/apache/beam/blob/24e9cedcc768d901de795477fa78c7f357635671/sdks/java/core/src/main/java/org/apache/beam/sdk/options/PipelineOptionsFactory.java#L163
2:
https://github.com/apache/beam/blob/24e9cedcc768d901de795477fa78c7f357635671/sdks/python/apache_beam/options/pipeline_options.py#L822


Re: Late data handling in Python SDK

2019-08-09 Thread Lukasz Cwik
+dev 

Related JIRA's I found are BEAM-3759 and BEAM-7825. This has been a
priority thing as the community has been trying to get streaming Python
execution working on multiple Beam runners.

On Wed, Aug 7, 2019 at 2:31 AM Sam Stephens  wrote:

> Hi all,
>
> I’ve been reading into, and experimenting with, the Python SDK recently. I
> see that late data handling is not supported currently, but I can’t find
> where the progress of this feature is tracked: either in the portability
> support matrix or in Beam’s Jira. Can anyone help me there?
>
> If the answer is that there isn’t a single place to see this, my
> question is then: what is the current status of this? Is there some
> underlying difficult problem that has to be solved first or is it just a
> priority thing?
>
> Thanks
> Sam
>


Re: Write-through-cache in State logic

2019-08-09 Thread Lukasz Cwik
On Fri, Aug 9, 2019 at 2:32 AM Robert Bradshaw  wrote:

> The question is whether the SDK needs to wait for the StateResponse to
> come back before declaring the bundle done. The proposal was to not
> send the cache token back as part of an append StateResponse [1], but
> pre-provide it as part of the bundle request.
>

Agree, the purpose of the I'm Blocked message is to occur during bundle
processing.


> Thinking about this some more, if we assume the state response was
> successfully applied, there's no reason for the SDK to block the
> bundle until it has its hands on the cache token--we can update the
> cache once the StateResponse comes back whether or not the bundle is
> still active. On the other hand, the runner needs a way to assert it
> has received and processed all StateRequests from the SDK associated
> with a bundle before it can declare the bundle complete (regardless of
> the cache tokens), so this might not be safe without some extra
> coordination (e.g. the ProcessBundleResponse indicating the number of
> state requests associated with a bundle).
>

Since the state request stream is ordered, we can add the id of the last
state request as part of the ProcessBundleResponse.


> [1]
> https://github.com/apache/beam/blob/release-2.14.0/model/fn-execution/src/main/proto/beam_fn_api.proto#L627
>
> On Thu, Aug 8, 2019 at 6:57 PM Lukasz Cwik  wrote:
> >
> > The purpose of the new state API call in BEAM-7000 is to tell the runner
> that the SDK is now blocked waiting for the result of a specific state
> request and it should be used for fetches (not updates) and is there to
> allow for SDKs to differentiate readLater (I will need this data at some
> point in time in the future) from read (I need this data now). This comes
> up commonly where the user prefetches multiple state cells and then looks
> at their content allowing the runner to batch up those calls on its end.
> >
> > The way it can be used for clear+append is that the runner can store
> requests in memory up until some time/memory limit or until it gets its
> first "blocked" call and then issue all the requests together.
> >
> >
> > On Thu, Aug 8, 2019 at 9:42 AM Robert Bradshaw 
> wrote:
> >>
> >> On Tue, Aug 6, 2019 at 12:07 AM Thomas Weise  wrote:
> >> >
> >> > That would add a synchronization point that forces extra latency
> especially in streaming mode.
> >> >
> >> > Wouldn't it be possible for the runner to assign the token when
> starting the bundle and for the SDK to pass it along the state requests?
> That way, there would be no need to batch and wait for a flush.
> >>
> >> I think it makes sense to let the runner pre-assign these state update
> >> tokens rather than forcing a synchronization point.
> >>
> >> Here's some pointers for the Python implementation:
> >>
> >> Currently, when a DoFn needs UserState, a StateContext object is used
> >> that converts from a StateSpec to the actual value. When running
> >> portably, this is FnApiUserStateContext [1]. The state handles
> >> themselves are cached at [2] but this context only lives for the
> >> lifetime of a single bundle. Logic could be added here to use the
> >> token to share these across bundles.
> >>
> >> Each of these handles in turn invokes state_handler.get* methods when
> >> its read is called. (Here state_handler is a thin wrapper around the
> >> service itself) and constructs the appropriate result from the
> >> StateResponse. We would need to implement caching at this level as
> >> well, including the deserialization. This will probably require some
> >> restructoring of how _StateBackedIterable is implemented (or,
> >> possibly, making that class itself cache aware). Hopefully that's
> >> enough to get started.
> >>
> >> [1]
> https://github.com/apache/beam/blob/release-2.14.0/sdks/python/apache_beam/runners/worker/bundle_processor.py#L402
> >> [2]
> https://github.com/apache/beam/blob/release-2.14.0/sdks/python/apache_beam/runners/worker/bundle_processor.py#L436
> >> .
> >>
> >> > On Mon, Aug 5, 2019 at 2:49 PM Lukasz Cwik  wrote:
> >> >>
> >> >> I believe the intent is to add a new state API call telling the
> runner that it is blocked waiting for a response (BEAM-7000).
> >> >>
> >> >> This should allow the runner to wait till it sees one of these I'm
> blocked requests and then merge + batch any state calls it may have at that
> point in time allowing it to convert clear + appends into set calls and do
> any other optimizati

Re: Inconsistent Results with GroupIntoBatches PTransform

2019-08-08 Thread Lukasz Cwik
Have you tried running this on more than one runner (e.g. Dataflow, Flink,
Direct)?

Are you setting --streaming when executing?

On Thu, Aug 8, 2019 at 10:23 AM rahul patwari 
wrote:

> Hi,
>
> I am getting inconsistent results when using GroupIntoBatches PTransform.
> I am using Create.of() PTransform to create a PCollection from in-memory.
> When a coder is given with Create.of() PTransform, I am facing the issue.
> If the coder is not provided, the results are consistent and correct(Maybe
> this is just a coincidence and the problem is at some other place).
> If Batch Size is 1, results are always consistent.
>
> Not sure if this is an issue with Serialization/Deserialization (or)
> GroupIntoBatches (or) Create.of() PTransform.
>
> The Java code, expected correct results, and inconsistent results are
> available at https://github.com/rahul8383/beam-examples
>
> Thanks,
> Rahul
>


Re: Proposal for SDFs in the Go SDK

2019-08-08 Thread Lukasz Cwik
Thanks for the informative doc. Added a bunch of questions/feedback.

On Thu, Aug 8, 2019 at 9:15 AM Robert Burke  wrote:

> Thanks for the spending the time writing this up! I'm looking forward to
> seeing how the prototype implementation plays out. In particular with the
> extensive section on how users will actually use the presented API to get
> their DoFns to scale.
>
>  (Disclosure: I helped pre-review the document, which is why I don't have
> any further commentary at this time.)
>
> On Wed, Aug 7, 2019, 11:57 AM Daniel Oliveira 
> wrote:
>
>> Hello Beam devs,
>>
>> I've been working on a proposal for implementing SDFs in the Go SDK. For
>> those who were unaware, the Go SDK hasn't supported SDFs in any capacity
>> yet, so my proposal covers the user-facing API and a basic look into how it
>> will work under the hood.
>>
>> I'd appreciate it if anyone interested in the Go SDK or anyone who's been
>> working with portable SDFs could give it a look and provide some feedback.
>> There's still a few open questions mentioned in the doc that I'd like to
>> get feedback on before deciding on anything.
>>
>>
>> https://docs.google.com/document/d/14IwJYEUpar5FmiPNBFvERADiShZjsrsMpgtlntPVCX0/edit?usp=sharing
>>
>> Thanks,
>> Daniel Oliveira
>>
>


Re: Allowing firewalled/offline builds of Beam

2019-08-08 Thread Lukasz Cwik
Udi beat me by a couple of mins.

We build a good portion of the Beam Java codebase internally within Google
by bypassing the gradle wrapper (gradlew) and executing the gradle command
from a full gradle installation at the root of a copy of the Beam codebase.

It does require your internal build system to use a version of gradle that
is compatible with the version[1] that gradlew uses and you could create a
wrapper that figures out which version of gradle to use and select the
appropriate one from many local gradle installations. This should allow you
to bypass the gradlew script entirely and any downloading it does.

Note that gradle does support a --offline flag which we also use to ensure
that it doesn't pull stuff from the internet. Not sure if all the plugins
honor it but it works well enough for us to build most of the Beam Java
codebase with it.

1:
https://github.com/apache/beam/blob/497bc77c0d53098887156a014a659184097ef021/gradle/wrapper/gradle-wrapper.properties#L20

On Thu, Aug 8, 2019 at 11:15 AM Udi Meiri  wrote:

> You can download it here: https://gradle.org/releases/
> and run it instead of using the wrapper.
>
> Example:
> $ cd
> $ unzip Downloads/gradle-5.5.1-bin.zip
> $ cd ~/src/beam
> $ ~/gradle-5.5.1/bin/gradle lint
>
>
> On Thu, Aug 8, 2019 at 10:52 AM Chad Dombrova  wrote:
>
>> This topic came up in another thread, so I wanted to highlight a few
>> things that we've discovered in our endeavors to build Beam behind a
>> firewall.
>>
>> Conceptually, in order to allow this, a user needs to provide alternate
>> mirrors for each "artifact" service required during build, and luckily I
>> think most of the toolchains used by Beam support this. For example, the
>> default PyPI mirror used by pip can be overridden via env var to an
>> internal mirror, and likewise for docker and its registry service.  I'm
>> currently looking into gogradle to see if we can provide an alternate
>> vendor directory as a shared resource behind our firewall. (I have a bigger
>> question here, which is why was it necessary to add a third language into
>> the python Beam ecosystem, just for the bootstrap process?  Couldn't the
>> boot code use python, or java?)
>>
>> But I'm getting ahead of myself.  We're actually stuck at the very
>> beginning, with gradlew.  The gradlew wrapper seems to unconditionally
>> download gradle, so you can't get past the first few hundred lines of code
>> in the build process without requiring internet access.  I made a ticket
>> here: https://issues.apache.org/jira/browse/BEAM-7931.  I'd love some
>> pointers on how to fix this, because the offending code lives inside
>> gradle-wrapper.jar, so I can't change it without access to the source.
>>
>> thanks,
>> -chad
>>
>>


Re: Java 11 compatibility question

2019-08-07 Thread Lukasz Cwik
Since java8 -> java11 is similar to python2 -> python3 migration, what was
the acceptance criteria there?

On Wed, Aug 7, 2019 at 1:54 PM Elliotte Rusty Harold 
wrote:

>
>
> On Wed, Aug 7, 2019 at 9:41 AM Michał Walenia 
> wrote:
>
>>
>> Are these tests sufficient to say that we’re java 11 compatible? What
>> other aspects do we need to test to be able to say that?
>>
>>
> Are any packages split across multiple jar files, including packages beam
> dependns on? That's the one that's bitten some other projects, including
> google-cloud-java and gRPC. If so, beam is not going to work with the
> module system.
>
> Work is ongoing to fix splitn packages in both gRPC and google-cloud-java,
> but we're not very far down that path and I think it's going to be an API
> breaking change.
>
> Romain pointed this out earlier and I fixed the last case of packages
being split across multiple jars within Apache Beam but as you point out
our transitive dependencies are not ready.

>
> --
> Elliotte Rusty Harold
> elh...@ibiblio.org
>


Re: [DISCUSS] Turn `WindowedValue` into `T` in the FnDataService and BeamFnDataClient interface definition

2019-08-07 Thread Lukasz Cwik
I wanted to add some more details about the state discussion.

BEAM-7000 is about adding support for a gRPC message saying that the SDK is
now blocked on one of its requests. This would allow for an easy
optimization on the runner side where it gathers requests and is able to
batch them knowing that the SDK is only blocked once it sees one of the
blocked gRPC messages. This would make it easy for the runner to gather up
clear + append calls and convert them to sets internally.

Also, most of the reason around map state not existing has been since we
haven't discuessed the changes to the gRPC APIs that we need. (things like,
can you lookup/clear/append to ranges?, map or multimap?, should we really
just get rid of bag state in favor of a multimap state?, can you enumerate
keys?, know how many keys there are?, ...)

On Wed, Aug 7, 2019 at 9:52 AM Robert Bradshaw  wrote:

> The list looks good to me. Thanks for summarizing. Feel free to dive
> into any of these issues yourself :).
>
> On Fri, Aug 2, 2019 at 6:24 PM jincheng sun 
> wrote:
> >
> > Hi all,
> >
> >
> > Thanks a lot for sharing your thoughts!
> >
> >
> > It seems that we have already reached consensus for the following items.
> Could you please read through them again and double-check if you all agree
> with these? If yes, then I would start creating JIRA issues for those that
> don’t yet have a JIRA issue
> >
> >
> > 1. Items that require improvements of Beam:
> >
> >
> > 1) The configuration of "semi_persist_dir" should be configurable. (TODO)
> >
> >
> https://github.com/apache/beam/blob/master/sdks/python/container/boot.go#L48
> >
> >
> > 2) Time-based cache threshold should be supported. (TODO)
> >
> >
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/worker/data_plane.py#L259
> >
> >
> https://github.com/apache/beam/blob/master/sdks/java/fn-execution/src/main/java/org/apache/beam/sdk/fn/data/BeamFnDataBufferingOutboundObserver.java
> >
> >
> > 3) Cross-bundle cache should be supported. (
> https://issues.apache.org/jira/browse/BEAM-5428)
> >
> >
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/worker/bundle_processor.py#L349
> >
> >
> > 4) Allows to configure the log level. (TODO)
> >
> > https://issues.apache.org/jira/browse/BEAM-5468
> >
> >
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/worker/sdk_worker_main.py#L102
> >
> >
> > 5) Improves the interfaces of classes such as FnDataService,
> BundleProcessor, ActiveBundle, etc to change the parameter type from
> WindowedValue to T. (TODO)
> >
> >
> > 6) Python 3 is already supported in Beam. The warning should be removed.
> (TODO)
> >
> > https://github.com/apache/beam/blob/master/sdks/python/setup.py#L179
> >
> >
> > 7) The coder of WindowedValue should be configurable which makes it
> possible to use customization coder such as ValueOnlyWindowedValueCoder.
> (TODO)
> >
> >
> https://github.com/apache/beam/blob/master/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/wire/WireCoders.java#L91
> >
> >
> > 8) The schema work can be used to solve the performance issue of the
> extra prefixing length of encoding. However, it should also be supported in
> Python. (https://github.com/apache/beam/pull/9188)
> >
> >
> https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/schema.proto
> >
> >
> > 9) MapState should be supported in the gRPC protocol. (TODO)
> >
> >
> https://github.com/apache/beam/blob/master/model/fn-execution/src/main/proto/beam_fn_api.proto#L662
> >
> >
> >
> >
> > 2. Items where we don’t need to do anything for now:
> >
> >
> > 1) The default buffer size is enough for most cases and there is no need
> to make it configurable for now.
> >
> > 2) Do not support ValueState in the gRPC protocol for now unless we have
> evidence it matters.
> >
> >
> >
> > If there are any incorrect understanding,  please feel free to correct
> me :)
> >
> >
> > 
> >
> >
> > There are also some items that I didn’t bring up earlier which require
> further discussion:
> >
> > 1) The input queue size of the input buffer in Python SDK Harness is not
> size limited. We should give a reasonable default size.
> >
> >
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/worker/data_plane.py#L175
> >
> >
> > 2) Refactor the code to avoid unnecessary dependencies pull in. For
> example, beam-sdks-java-core(11MB) is a package for Java SDK users and it
> is pull in because there are a few classes in beam-sdks-java-core are used
> in beam-runners-java-fn-execution, such as:
> >
> > PipelineOptions used in DefaultJobBundleFactory FileSystems used in
> BeamFileSystemArtifactRetrievalService.
> >
> > It means maybe we can add a new module such as beam-sdks-java-common to
> hold the classes used by both runner and SDK.
> >
> >
> > 3) Allows to start up StatusServer according to configuration in Python
> SDK Harness. Currently the StatusServer is 

Re: Waht would be the best place for performance tests documentation?

2019-08-07 Thread Lukasz Cwik
I also think confluence makes the most sense.

On Wed, Aug 7, 2019 at 11:57 AM Alexey Romanenko 
wrote:

> I agree with Cyrus that Confluence page should a good place for that
> since, seems, it will be very dev oriented documentation.
>
>
> On 7 Aug 2019, at 16:31, Cyrus Maden  wrote:
>
> Hi Łukasz,
>
> This sounds perfect for the confluence, since we already have the testing
> guide
> 
> in there. There's already a link to the testing guide in the
> beam.apache.org/contribute section, which might help direct folks to the
> new doc as well.
>
> Best,
> Cyrus
>
> On Wed, Aug 7, 2019 at 7:54 AM Łukasz Gajowy  wrote:
>
>> Hi all,
>>
>> I'm currently working on documenting the load tests of Core Apache Beam
>> operations
>> 
>> as we have some Jenkins jobs running and several dashboards for that.
>>
>> This is what I've got so far (work in progress but comments are welcome):
>> LINK
>> 
>>
>> I've got the following doubts on where I should put it:
>>
>>- the tests are rather dev-facing. Should the documentation for them
>>be on the beam.apache.org page (where the user-facing documentation
>>is located) or maybe only on Beam's confluence?
>>- if that belongs to confluence only, should I simply place it under
>>"technical/design docs" section?
>>- if this goes on the website, what is the best place to put it?
>>
>> What are your thoughts?
>>
>> Thanks,
>> Łukasz
>>
>
>


Re: Write-through-cache in State logic

2019-08-05 Thread Lukasz Cwik
I believe the intent is to add a new state API call telling the runner that
it is blocked waiting for a response (BEAM-7000).

This should allow the runner to wait till it sees one of these I'm blocked
requests and then merge + batch any state calls it may have at that point
in time allowing it to convert clear + appends into set calls and do any
other optimizations as well. By default, the runner would have a time and
space based limit on how many outstanding state calls there are before
choosing to resolve them.

On Mon, Aug 5, 2019 at 5:43 PM Lukasz Cwik  wrote:

> Now I see what you mean.
>
> On Mon, Aug 5, 2019 at 5:42 PM Thomas Weise  wrote:
>
>> Hi Luke,
>>
>> I guess the answer is that it depends on the state backend. If a set
>> operation in the state backend is available that is more efficient than
>> clear+append, then it would be beneficial to have a dedicated fn api
>> operation to allow for such optimization. That's something that needs to be
>> determined with a profiler :)
>>
>> But the low hanging fruit is cross-bundle caching.
>>
>> Thomas
>>
>> On Mon, Aug 5, 2019 at 2:06 PM Lukasz Cwik  wrote:
>>
>>> Thomas, why do you think a single round trip is needed?
>>>
>>> clear + append can be done blindly from the SDK side and it has total
>>> knowledge of the state at that point in time till the end of the bundle at
>>> which point you want to wait to get the cache token back from the runner
>>> for the append call so that for the next bundle you can reuse the state if
>>> the key wasn't processed elsewhere.
>>>
>>> Also, all state calls are "streamed" over gRPC so you don't need to wait
>>> for clear to complete before being able to send append.
>>>
>>> On Tue, Jul 30, 2019 at 12:58 AM jincheng sun 
>>> wrote:
>>>
>>>> Hi Rakesh,
>>>>
>>>> Glad to see you pointer this problem out!
>>>> +1 for add this implementation. Manage State by write-through-cache is
>>>> pretty important for Streaming job!
>>>>
>>>> Best, Jincheng
>>>>
>>>> Thomas Weise  于2019年7月29日周一 下午8:54写道:
>>>>
>>>>> FYI a basic test appears to confirm the importance of the cross-bundle
>>>>> caching: I found that the throughput can be increased by playing with the
>>>>> bundle size in the Flink runner. Default caps at 1000 elements (or 1
>>>>> second). So on a high throughput stream the bundles would be capped by the
>>>>> count limit. Bumping the count limit increases the throughput by reducing
>>>>> the chatter over the state plane (more cache hits due to larger bundle).
>>>>>
>>>>> The next level of investigation would involve profiling. But just by
>>>>> looking at metrics, the CPU utilization on the Python worker side dropped
>>>>> significantly while on the Flink side it remains nearly same. There are no
>>>>> metrics for state operations on either side, I think it would be very
>>>>> helpful to get these in place also.
>>>>>
>>>>> Below the stateful processing code for reference.
>>>>>
>>>>> Thomas
>>>>>
>>>>>
>>>>> class StatefulFn(beam.DoFn):
>>>>> count_state_spec = userstate.CombiningValueStateSpec(
>>>>> 'count', beam.coders.IterableCoder(beam.coders.VarIntCoder()),
>>>>> sum)
>>>>> timer_spec = userstate.TimerSpec('timer',
>>>>> userstate.TimeDomain.WATERMARK)
>>>>>
>>>>> def process(self, kv,
>>>>> count=beam.DoFn.StateParam(count_state_spec),
>>>>> timer=beam.DoFn.TimerParam(timer_spec), window=beam.DoFn.WindowParam):
>>>>> count.add(1)
>>>>> timer_seconds = (window.end.micros // 100) - 1
>>>>> timer.set(timer_seconds)
>>>>>
>>>>> @userstate.on_timer(timer_spec)
>>>>> def process_timer(self,
>>>>> count=beam.DoFn.StateParam(count_state_spec), 
>>>>> window=beam.DoFn.WindowParam):
>>>>> if count.read() == 0:
>>>>> logging.warning("###timer fired with count %d, window %s"
>>>>> % (count.read(), window))
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Jul 25, 2019 at 5:09 AM Robert Bradshaw 
>>>>> wrote:
>>>>>
>>>>>> On 

Re: Write-through-cache in State logic

2019-08-05 Thread Lukasz Cwik
Now I see what you mean.

On Mon, Aug 5, 2019 at 5:42 PM Thomas Weise  wrote:

> Hi Luke,
>
> I guess the answer is that it depends on the state backend. If a set
> operation in the state backend is available that is more efficient than
> clear+append, then it would be beneficial to have a dedicated fn api
> operation to allow for such optimization. That's something that needs to be
> determined with a profiler :)
>
> But the low hanging fruit is cross-bundle caching.
>
> Thomas
>
> On Mon, Aug 5, 2019 at 2:06 PM Lukasz Cwik  wrote:
>
>> Thomas, why do you think a single round trip is needed?
>>
>> clear + append can be done blindly from the SDK side and it has total
>> knowledge of the state at that point in time till the end of the bundle at
>> which point you want to wait to get the cache token back from the runner
>> for the append call so that for the next bundle you can reuse the state if
>> the key wasn't processed elsewhere.
>>
>> Also, all state calls are "streamed" over gRPC so you don't need to wait
>> for clear to complete before being able to send append.
>>
>> On Tue, Jul 30, 2019 at 12:58 AM jincheng sun 
>> wrote:
>>
>>> Hi Rakesh,
>>>
>>> Glad to see you pointer this problem out!
>>> +1 for add this implementation. Manage State by write-through-cache is
>>> pretty important for Streaming job!
>>>
>>> Best, Jincheng
>>>
>>> Thomas Weise  于2019年7月29日周一 下午8:54写道:
>>>
>>>> FYI a basic test appears to confirm the importance of the cross-bundle
>>>> caching: I found that the throughput can be increased by playing with the
>>>> bundle size in the Flink runner. Default caps at 1000 elements (or 1
>>>> second). So on a high throughput stream the bundles would be capped by the
>>>> count limit. Bumping the count limit increases the throughput by reducing
>>>> the chatter over the state plane (more cache hits due to larger bundle).
>>>>
>>>> The next level of investigation would involve profiling. But just by
>>>> looking at metrics, the CPU utilization on the Python worker side dropped
>>>> significantly while on the Flink side it remains nearly same. There are no
>>>> metrics for state operations on either side, I think it would be very
>>>> helpful to get these in place also.
>>>>
>>>> Below the stateful processing code for reference.
>>>>
>>>> Thomas
>>>>
>>>>
>>>> class StatefulFn(beam.DoFn):
>>>> count_state_spec = userstate.CombiningValueStateSpec(
>>>> 'count', beam.coders.IterableCoder(beam.coders.VarIntCoder()),
>>>> sum)
>>>> timer_spec = userstate.TimerSpec('timer',
>>>> userstate.TimeDomain.WATERMARK)
>>>>
>>>> def process(self, kv, count=beam.DoFn.StateParam(count_state_spec),
>>>> timer=beam.DoFn.TimerParam(timer_spec), window=beam.DoFn.WindowParam):
>>>> count.add(1)
>>>> timer_seconds = (window.end.micros // 100) - 1
>>>> timer.set(timer_seconds)
>>>>
>>>> @userstate.on_timer(timer_spec)
>>>> def process_timer(self,
>>>> count=beam.DoFn.StateParam(count_state_spec), 
>>>> window=beam.DoFn.WindowParam):
>>>> if count.read() == 0:
>>>> logging.warning("###timer fired with count %d, window %s" %
>>>> (count.read(), window))
>>>>
>>>>
>>>>
>>>> On Thu, Jul 25, 2019 at 5:09 AM Robert Bradshaw 
>>>> wrote:
>>>>
>>>>> On Wed, Jul 24, 2019 at 6:21 AM Rakesh Kumar 
>>>>> wrote:
>>>>> >
>>>>> > Thanks Robert,
>>>>> >
>>>>> >  I stumble on the jira that you have created some time ago
>>>>> > https://jira.apache.org/jira/browse/BEAM-5428
>>>>> >
>>>>> > You also marked code where code changes are required:
>>>>> >
>>>>> https://github.com/apache/beam/blob/7688bcfc8ebb4bedf26c5c3b3fe0e13c0ec2aa6d/sdks/python/apache_beam/runners/worker/bundle_processor.py#L291
>>>>> >
>>>>> https://github.com/apache/beam/blob/7688bcfc8ebb4bedf26c5c3b3fe0e13c0ec2aa6d/sdks/python/apache_beam/runners/worker/bundle_processor.py#L349
>>>>> >
>>>>> https://github.com/apache/beam/blob/7688bcfc8ebb4bedf26c5c3b3fe0e13c0ec2aa6d/sdks/python/apache_b

Re: Latency of Google Dataflow with Pubsub

2019-08-05 Thread Lukasz Cwik
+dev 

On Mon, Aug 5, 2019 at 12:49 PM Dmitry Minaev  wrote:

> Hi there,
>
> I'm building streaming pipelines in Beam (using Google Dataflow runner)
> and using Google Pubsub as a message broker. I've made a couple of
> experiments with a very simple pipeline: consume events from Pubsub
> subscription, add a timestamp to the message body, emit the new event to
> another Pubsub topic. I'm using all the default parameters when producing
> and consuming messages.
>
> I've noticed a pretty high latency while consuming messages in Dataflow
> from Pubsub. My observations show that average duration between the event
> create timestamp (simple producer that publishes events to Pubsub) and
> event consume timestamp (Google Dataflow using PubsubIO) is more than 2
> seconds. I've been publishing messages at different rates, e.g. 10 msg/sec,
> 1000 msg/sec, 10,000 msg/sec. And the latency never went lower than 2
> seconds. Such latency looks really high. I've tried with direct runner and
> it has high latency too.
>
> I've made a few other experiments with Kafka (very small Kafka cluster)
> and the same kind of pipeline: consume from Kafka, add timestamp, publish
> to another Kafka topic. I saw the latency is much lower, on average it's
> about 150 milliseconds.
>
> I suspect there is some batching in PubsubIO that makes the latency so
> high.
>
> My questions are: what should be expected latency in this kind of
> scenarios? Is there any recommendations to achieve lower latency?
>
> I appreciate any help on this!
>
> Thank you,
> Dmitry.
>


Re: Write-through-cache in State logic

2019-08-05 Thread Lukasz Cwik
PM Rakesh Kumar 
>>> wrote:
>>> >> >
>>> >> > I checked the python sdk[1] and it has similar implementation as
>>> Java SDK.
>>> >> >
>>> >> > I would agree with Thomas. In case of high volume event stream and
>>> bigger cluster size, network call can potentially cause a bottleneck.
>>> >> >
>>> >> > @Robert
>>> >> > I am interested to see the proposal. Can you provide me the link of
>>> the proposal?
>>> >> >
>>> >> > [1]:
>>> https://github.com/apache/beam/blob/db59a3df665e094f0af17fe4d9df05fe420f3c16/sdks/python/apache_beam/transforms/userstate.py#L295
>>> >> >
>>> >> >
>>> >> > On Tue, Jul 16, 2019 at 9:43 AM Thomas Weise 
>>> wrote:
>>> >> >>
>>> >> >> Thanks for the pointer. For streaming, it will be important to
>>> support caching across bundles. It appears that even the Java SDK doesn't
>>> support that yet?
>>> >> >>
>>> >> >>
>>> https://github.com/apache/beam/blob/77b295b1c2b0a206099b8f50c4d3180c248e252c/sdks/java/harness/src/main/java/org/apache/beam/fn/harness/FnApiDoFnRunner.java#L221
>>> >> >>
>>> >> >> Regarding clear/append: It would be nice if both could occur
>>> within a single Fn Api roundtrip when the state is persisted.
>>> >> >>
>>> >> >> Thanks,
>>> >> >> Thomas
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> On Tue, Jul 16, 2019 at 6:58 AM Lukasz Cwik 
>>> wrote:
>>> >> >>>
>>> >> >>> User state is built on top of read, append and clear and not off
>>> a read and write paradigm to allow for blind appends.
>>> >> >>>
>>> >> >>> The optimization you speak of can be done completely inside the
>>> SDK without any additional protocol being required as long as you clear the
>>> state first and then append all your new data. The Beam Java SDK does this
>>> for all runners when executed portably[1]. You could port the same logic to
>>> the Beam Python SDK as well.
>>> >> >>>
>>> >> >>> 1:
>>> https://github.com/apache/beam/blob/41478d00d34598e56471d99d0845ac16efa5b8ef/sdks/java/harness/src/main/java/org/apache/beam/fn/harness/state/BagUserState.java#L84
>>> >> >>>
>>> >> >>> On Tue, Jul 16, 2019 at 5:54 AM Robert Bradshaw <
>>> rober...@google.com> wrote:
>>> >> >>>>
>>> >> >>>> Python workers also have a per-bundle SDK-side cache. A protocol
>>> has
>>> >> >>>> been proposed, but hasn't yet been implemented in any SDKs or
>>> runners.
>>> >> >>>>
>>> >> >>>> On Tue, Jul 16, 2019 at 6:02 AM Reuven Lax 
>>> wrote:
>>> >> >>>> >
>>> >> >>>> > It's runner dependent. Some runners (e.g. the Dataflow runner)
>>> do have such a cache, though I think it's currently has a cap for large
>>> bags.
>>> >> >>>> >
>>> >> >>>> > Reuven
>>> >> >>>> >
>>> >> >>>> > On Mon, Jul 15, 2019 at 8:48 PM Rakesh Kumar <
>>> rakeshku...@lyft.com> wrote:
>>> >> >>>> >>
>>> >> >>>> >> Hi,
>>> >> >>>> >>
>>> >> >>>> >> I have been using python sdk for the application and also
>>> using BagState in production. I was wondering whether state logic has any
>>> write-through-cache implemented or not. If we are sending every read and
>>> write request through network then it comes with a performance cost. We can
>>> avoid network call for a read operation if we have write-through-cache.
>>> >> >>>> >> I have superficially looked into the implementation and I
>>> didn't see any cache implementation.
>>> >> >>>> >>
>>> >> >>>> >> is it possible to have this cache? would it cause any issue
>>> if we have the caching layer?
>>> >> >>>> >>
>>>
>>


[RESULT] [VOTE] Vendored Dependencies Release

2019-07-16 Thread Lukasz Cwik
I'm happy to announce that we have unanimously approved this release.

There are 4 approving votes, 3 of which are binding:
* Ismaël Mejía
* Lukasz Cwik
* Pablo Estrada

There are no disapproving votes.

Thanks everyone!

On Tue, Jul 16, 2019 at 4:30 AM Ismaël Mejía  wrote:

> +1
>
> Run build and used diffoscope [1] to compare the generated vs staged
> files. We should probably in the future make the full gradle build
> reproducible to make this comparison trivial (a simple diff).
>
> [1] https://diffoscope.org/
>
> On Tue, Jul 16, 2019 at 2:18 AM Lukasz Cwik  wrote:
> >
> > +1
> >
> > On Mon, Jul 15, 2019 at 8:14 PM Pablo Estrada 
> wrote:
> >>
> >> +1
> >> verified hashes and signatures
> >>
> >> On Fri, Jul 12, 2019 at 9:40 AM Kai Jiang  wrote:
> >>>
> >>> +1 (non-binding)
> >>>
> >>> On Thu, Jul 11, 2019 at 8:27 PM Lukasz Cwik  wrote:
> >>>>
> >>>> Please review the release of the following artifacts that we vendor:
> >>>>  * beam-vendor-grpc_1_21_0
> >>>>  * beam-vendor-guava-26_0-jre
> >>>>  * beam-vendor-bytebuddy-1_9_3
> >>>>
> >>>> Hi everyone,
> >>>> Please review and vote on the release candidate #3 for the
> org.apache.beam:beam-vendor-grpc_1_21_0:0.1,
> org.apache.beam:beam-vendor-guava-26_0-jre:0.1, and
> org.apache.beam:beam-vendor-bytebuddy-1_9_3:0.1 as follows:
> >>>> [ ] +1, Approve the release
> >>>> [ ] -1, Do not approve the release (please provide specific comments)
> >>>>
> >>>>
> >>>> The complete staging area is available for your review, which
> includes:
> >>>> * the official Apache source release to be deployed to
> dist.apache.org [1], which is signed with the key with fingerprint
> EAD5DE293F4A03DD2E77565589E68A56E371CCA2 [2],
> >>>> * all artifacts to be deployed to the Maven Central Repository [3],
> >>>> * commit hash "0fce2b88660f52dae638697e1472aa108c982ae6" [4],
> >>>>
> >>>> The vote will be open for at least 72 hours. It is adopted by
> majority approval, with at least 3 PMC affirmative votes.
> >>>>
> >>>> Thanks,
> >>>> Luke
> >>>>
> >>>> [1] https://dist.apache.org/repos/dist/dev/beam/vendor/
> >>>> [2] https://dist.apache.org/repos/dist/release/beam/KEYS
> >>>> [3]
> https://repository.apache.org/content/repositories/orgapachebeam-1078/
> >>>> [4]
> https://github.com/apache/beam/commit/0fce2b88660f52dae638697e1472aa108c982ae6
>


Re: Write-through-cache in State logic

2019-07-16 Thread Lukasz Cwik
User state is built on top of read, append and clear and not off a read and
write paradigm to allow for blind appends.

The optimization you speak of can be done completely inside the SDK without
any additional protocol being required as long as you clear the state first
and then append all your new data. The Beam Java SDK does this for all
runners when executed portably[1]. You could port the same logic to the
Beam Python SDK as well.

1:
https://github.com/apache/beam/blob/41478d00d34598e56471d99d0845ac16efa5b8ef/sdks/java/harness/src/main/java/org/apache/beam/fn/harness/state/BagUserState.java#L84

On Tue, Jul 16, 2019 at 5:54 AM Robert Bradshaw  wrote:

> Python workers also have a per-bundle SDK-side cache. A protocol has
> been proposed, but hasn't yet been implemented in any SDKs or runners.
>
> On Tue, Jul 16, 2019 at 6:02 AM Reuven Lax  wrote:
> >
> > It's runner dependent. Some runners (e.g. the Dataflow runner) do have
> such a cache, though I think it's currently has a cap for large bags.
> >
> > Reuven
> >
> > On Mon, Jul 15, 2019 at 8:48 PM Rakesh Kumar 
> wrote:
> >>
> >> Hi,
> >>
> >> I have been using python sdk for the application and also using
> BagState in production. I was wondering whether state logic has any
> write-through-cache implemented or not. If we are sending every read and
> write request through network then it comes with a performance cost. We can
> avoid network call for a read operation if we have write-through-cache.
> >> I have superficially looked into the implementation and I didn't see
> any cache implementation.
> >>
> >> is it possible to have this cache? would it cause any issue if we have
> the caching layer?
> >>
>


Re: [VOTE] Vendored Dependencies Release

2019-07-15 Thread Lukasz Cwik
+1

On Mon, Jul 15, 2019 at 8:14 PM Pablo Estrada  wrote:

> +1
> verified hashes and signatures
>
> On Fri, Jul 12, 2019 at 9:40 AM Kai Jiang  wrote:
>
>> +1 (non-binding)
>>
>> On Thu, Jul 11, 2019 at 8:27 PM Lukasz Cwik  wrote:
>>
>>> Please review the release of the following artifacts that we vendor:
>>>  * beam-vendor-grpc_1_21_0
>>>  * beam-vendor-guava-26_0-jre
>>>  * beam-vendor-bytebuddy-1_9_3
>>>
>>> Hi everyone,
>>> Please review and vote on the release candidate #3 for the
>>> org.apache.beam:beam-vendor-grpc_1_21_0:0.1,
>>> org.apache.beam:beam-vendor-guava-26_0-jre:0.1, and
>>> org.apache.beam:beam-vendor-bytebuddy-1_9_3:0.1 as follows:
>>> [ ] +1, Approve the release
>>> [ ] -1, Do not approve the release (please provide specific comments)
>>>
>>>
>>> The complete staging area is available for your review, which includes:
>>> * the official Apache source release to be deployed to dist.apache.org
>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__dist.apache.org=DwMFaQ=r2dcLCtU9q6n0vrtnDw9vg=w9NQdMas-PbABIe2Vnjx7A=H-vJr42nQS3r2dZ0sg68kCEsLFh63F3K0LtWfglh0GU=WACf2VjyAok8vYovWTUvflYamsd4EQXbSWi8gsGgfUU=>
>>> [1], which is signed with the key with fingerprint
>>> EAD5DE293F4A03DD2E77565589E68A56E371CCA2 [2],
>>> * all artifacts to be deployed to the Maven Central Repository [3],
>>> * commit hash "0fce2b88660f52dae638697e1472aa108c982ae6" [4],
>>>
>>> The vote will be open for at least 72 hours. It is adopted by majority
>>> approval, with at least 3 PMC affirmative votes.
>>>
>>> Thanks,
>>> Luke
>>>
>>> [1] https://dist.apache.org/repos/dist/dev/beam/vendor/
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__dist.apache.org_repos_dist_dev_beam_vendor_=DwMFaQ=r2dcLCtU9q6n0vrtnDw9vg=w9NQdMas-PbABIe2Vnjx7A=H-vJr42nQS3r2dZ0sg68kCEsLFh63F3K0LtWfglh0GU=fMaLceYFRUJZG3uVye19m__BBD526cWSp0fU7Pf9G3c=>
>>> [2] https://dist.apache.org/repos/dist/release/beam/KEYS
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__dist.apache.org_repos_dist_release_beam_KEYS=DwMFaQ=r2dcLCtU9q6n0vrtnDw9vg=w9NQdMas-PbABIe2Vnjx7A=H-vJr42nQS3r2dZ0sg68kCEsLFh63F3K0LtWfglh0GU=2yg-9826fgvQqKsFU0PzeQM3WRKH7W0a2KLMGYc4-vY=>
>>> [3]
>>> https://repository.apache.org/content/repositories/orgapachebeam-1078/
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__repository.apache.org_content_repositories_orgapachebeam-2D1078_=DwMFaQ=r2dcLCtU9q6n0vrtnDw9vg=w9NQdMas-PbABIe2Vnjx7A=H-vJr42nQS3r2dZ0sg68kCEsLFh63F3K0LtWfglh0GU=eGqZIf0gJOSVv8-SnBDwOQ4ntQHur9AvlxGMEu9kp1k=>
>>> [4]
>>> https://github.com/apache/beam/commit/0fce2b88660f52dae638697e1472aa108c982ae6
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_beam_commit_0fce2b88660f52dae638697e1472aa108c982ae6=DwMFaQ=r2dcLCtU9q6n0vrtnDw9vg=w9NQdMas-PbABIe2Vnjx7A=H-vJr42nQS3r2dZ0sg68kCEsLFh63F3K0LtWfglh0GU=rCQPOJjP75jRgJ6iq4ZKH8yu5HG9CU6aUKWMjadOKXA=>
>>>
>>


Re: Return types of Write transforms (aka best way to signal)

2019-07-15 Thread Lukasz Cwik
In the POutput case (4), does that mean we will have to compute all those
outputs in the transform even if they aren't used?

If yes, I prefer (6) because it allows for the transform structure to be
modified to either produce these additional outputs only if they will be
consumed instead of having them produced all the time.

On Mon, Jul 15, 2019 at 11:20 AM Ismaël Mejía  wrote:

> Just wanted to bring back the conversation on this subject. A quick
> abstract of the discussion so far:
>
> We are trying to agree in the best approach for return types in Write
> transforms towards some sort of ‘homogenization’ in IOs.
> At the moment we mostly agree that the best approach for return types
> in Writes are:
>
> 4. Write returns ‘a class that implements POutput’
> 6. Write returns `PCollection`
>
> There are still some details to discuss:
>
> 1. Is (4) somehow less composable than (6) ?
> 2. Could it make sense (in the API sense) that `SourceSpecificWriteResult`
> contains a PCollection too as an attribute to cover the ‘tuple’ return
> type issue?
>
> In its last email Reuven mentioned some extra points that could change
> the direction towards one of the options. Anyone else has comments /
> more ideas we may be missing?
>
> On Thu, Jun 27, 2019 at 6:32 PM Reuven Lax  wrote:
> >
> > This is a good question, because many sinks are logically _not_
> windowed. They aren't producing aggregations, so logically they are often
> treated as if they are in the global window (and many internal window into
> the global window first thing).
> >
> > Wait is a nice transform that reuses existing windowing, but I wonder if
> there's another way to model this without relying on windowing. Essentially
> you want way to track element provenance - when all results from a single
> element are flushed through another transform, then trigger a second
> transform. Element provenance is interesting in many other use cases as
> well (e.g. debugging: given an output element, what input elements caused
> it?). Maybe there's a more direct way to model this problem without trying
> to use windowing to track causality?
> >
> > Reuven
> >
> > On Thu, Jun 27, 2019 at 4:38 AM Reza Rokni  wrote:
> >>
> >> The use case of a transform  waiting for a SInk or Sinks to complete is
> very interesting indeed!
> >>
> >> Curious, if a sink internally makes use of a Global Window with
> processing time triggers to push its writes, what mechanism could be used
> to release a transform waiting for a signal from the Sink(s) that all
> processing is done and it can move forward?
> >>
> >> On Thu, 27 Jun 2019 at 03:58, Robert Bradshaw 
> wrote:
> >>>
> >>> Regarding Python, yes and no. Python doesn't distinguish at compile
> >>> time between (1), (2), and (6), but that doesn't mean it isn't part of
> >>> the public API and people might start counting on it, so it's in some
> >>> sense worse. We can also do (3) (which is less cumbersome in Python,
> >>> either returning a tuple or a dict) or (4).
> >>>
> >>> Good point about providing a simple solution (something that can be
> >>> waited on at least) and allowing for with* modifiers to return more.
> >>>
> >>> On Wed, Jun 26, 2019 at 7:08 PM Chamikara Jayalath <
> chamik...@google.com> wrote:
> >>> >
> >>> > BTW regarding Python SDK, I think the answer to this question is
> simpler for Python SDK due to the lack of types. Most examples I know just
> return a PCollection from the Write transform which may or may not be
> ignored by users. If the PCollection is used, the user should be aware of
> the element type of the returned PCollection and should use it accordingly
> in subsequent transforms.
> >>> >
> >>> > Thanks,
> >>> > Cham
> >>> >
> >>> > On Wed, Jun 26, 2019 at 9:57 AM Chamikara Jayalath <
> chamik...@google.com> wrote:
> >>> >>
> >>> >>
> >>> >>
> >>> >> On Wed, Jun 26, 2019 at 5:46 AM Robert Bradshaw <
> rober...@google.com> wrote:
> >>> >>>
> >>> >>> Good question.
> >>> >>>
> >>> >>> I'm not sure what could be done with (5) if it contains no deferred
> >>> >>> objects (e.g there's nothing to wait on).
> >>> >>>
> >>> >>> There is also (6) return PCollection.
> The
> >>> >>> advantage of (2) is that one can migrate to (1) or (6) without
> >>> >>> changing the public API, while giving something to wait on without
> >>> >>> promising anything about its contents.
> >>> >>>
> >>> >>>
> >>> >>> I would probably lean towards (4) for anything that would want to
> >>> >>> return multiple signals/outputs (e.g. successful vs. failed writes)
> >>> >>> and view (3) as being a "cheap" but more cumbersome for the user
> way
> >>> >>> of writing (4). In both cases, more information can be added in a
> >>> >>> forward-compatible way. Technically (4) could extend (3) if one
> wants
> >>> >>> to migrate from (3) to (4) to provide a nicer API in the future.
> (As
> >>> >>> an aside, it would be interesting if any of the schema work that
> lets
> >>> >>> us get rid of tuple tags for elements (e.g. join operations) 

Re: Circular dependencies between DataflowRunner and google cloud IO

2019-07-15 Thread Lukasz Cwik
When do you mean by separate package?

Most of our perf tests are integrated with PerfkitBenchmarker such as our
JDBC benchmark[1] and Python BigQuery benchmark[2] and point to a specific
IT class/method that contains the benchmarks.

Note that there was a recent discussion about moving away from
PerfkitBenchmarker though and it would be good to follow up with Lukasz
Gajowy about it since he has done a bunch of the existing benchmarks[3].

Finally, there were some docs about benchmarking on our design docs[4] page
that may provide additional details[5, 6].

1:
https://github.com/apache/beam/blob/master/.test-infra/jenkins/job_PerformanceTests_JDBC.groovy
2:
https://github.com/apache/beam/blob/master/.test-infra/jenkins/job_PerformanceTests_BigQueryIO_Python.groovy
3:
https://lists.apache.org/thread.html/dab1c093799248787e8b75e63b66d7389b594b649a4d9a4a5db1cfbb@%3Cdev.beam.apache.org%3E
4: https://beam.apache.org/contribute/design-documents/
5:
https://docs.google.com/document/d/1dA-5s6OHiP_cz-NRAbwapoKF5MEC1wKps4A5tFbIPKE
6:
https://docs.google.com/document/d/1Cb7XVmqe__nA_WCrriAifL-3WCzbZzV4Am5W_SkQLeA

On Mon, Jul 15, 2019 at 8:03 AM Michał Walenia 
wrote:

> Thanks for the information. I don't think that the task you mentioned fits
> my case - I want to run a performance test on a real service and running it
> with a battery of other tests doesn't make much sense to me.
> It's similar to the case of other integration tests that are excluded from
> the task you linked.
> Is keeping the test in a separate package viable in your opinion?
>
> Thanks!
> Michal
>
> On Fri, Jul 12, 2019 at 3:45 PM Lukasz Cwik  wrote:
>
>> Yes, there is a dependency between Dataflow -> GCP IOs and this is
>> expected since Dataflow depends on parts of those implementations for its
>> own execution purposes. We definitely don't want GCP IOs depending on
>> Dataflow since we would like users of other runners to still be able to use
>> GCP IOs without bringing in Dataflow specific dependencies.
>>
>> There is already a test definition inside of the Dataflow runner package
>> that is meant to run integration tests defined in the GCP IO package named
>> googleCloudPlatformLegacyWorkerIntegrationTest[1] task, does this fit your
>> needs?
>>
>> 1:
>> https://github.com/apache/beam/blob/0fce2b88660f52dae638697e1472aa108c982ae6/runners/google-cloud-dataflow-java/build.gradle#L318
>>
>> On Fri, Jul 12, 2019 at 5:17 AM Michał Walenia <
>> michal.wale...@polidea.com> wrote:
>>
>>> Hi all,
>>> recently when I was trying to implement a performance test of
>>> BigQueryIO, I ran into an issue when trying to run the test on Dataflow.
>>> The problem was that I encountered a circular dependency when compiling
>>> the tests. I added the test in org.apache.beam.sdk.io.gcp.bigquery package,
>>> so I also needed to add DataflowRunner as a dependency in order to launch
>>> the test. The error was that DataflowRunner package depends on
>>> org.apache.beam.sdk.io.gcp.bigquery package (for example in [1]).
>>> Should it be like that?
>>> For now, in order to solve the problem, I intend to move the performance
>>> test to its own package in my PR [2] I am wondering about the right
>>> approach to this - shouldn’t we decouple the DataflowRunner code from IOs?
>>> If not, what’s the reason behind the way the modules are organized?
>>> I noticed that 5 tests are excluded from the integrationTest task in
>>> io.google-cloud-platform.bigquery build.gradle file [3]. Are they
>>> launched on Dataflow anywhere? I couldn’t find their usage except for the
>>> exclusions.
>>>
>>> [1] PubSubIO translations section in DataflowRunner.java
>>> <https://github.com/apache/beam/blob/0fce2b88660f52dae638697e1472aa108c982ae6/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java#L1104>
>>> [2] My PR <https://github.com/apache/beam/pull/9041>
>>> [3] DefaultCoderCloudObjectTranslatorRegistrar
>>> <https://github.com/apache/beam/blob/0fce2b88660f52dae638697e1472aa108c982ae6/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/util/DefaultCoderCloudObjectTranslatorRegistrar.java#L45>
>>>
>>> Best regards
>>> Michal
>>>
>>> --
>>>
>>> Michał Walenia
>>> Polidea <https://www.polidea.com/> | Software Engineer
>>>
>>> M: +48 791 432 002 <+48791432002>
>>> E: michal.wale...@polidea.com
>>>
>>> Unique Tech
>>> Check out our projects! <https://www.polidea.com/our-work>
>>>
>>
>
> --
>
> Michał Walenia
> Polidea <https://www.polidea.com/> | Software Engineer
>
> M: +48 791 432 002 <+48791432002>
> E: michal.wale...@polidea.com
>
> Unique Tech
> Check out our projects! <https://www.polidea.com/our-work>
>


Re: Beam/Samza Ensuring At Least Once semantics

2019-07-12 Thread Lukasz Cwik
That seems to be an issue with how the commit is being restarted in Samza
and not with the Kafka source.

On Thu, Jul 11, 2019 at 4:44 PM Deshpande, Omkar 
wrote:

> Yes, we are resuming from samza’s last commit. But the problem is that the
> last commit was done for data in the window that is not completely
> processed.
>
>
>
> *From: *Lukasz Cwik 
> *Date: *Wednesday, July 10, 2019 at 11:07 AM
> *To: *dev 
> *Cc: *"LeVeck, Matt" , "Deshpande, Omkar" <
> omkar_deshpa...@intuit.com>, Xinyu Liu , Xinyu Liu
> , Samarth Shetty , "Audo,
> Nicholas" , "Cesar, Scott" <
> scott_ce...@intuit.com>, "Ho, Tom" , "
> d...@samza.apache.org" 
> *Subject: *Re: Beam/Samza Ensuring At Least Once semantics
>
>
>
> This email is from an external sender.
>
>
>
> When you restart the application, are you resuming it from Samza's last
> commit?
>
>
>
> Since the exception is thrown after the GBK, all the data could be read
> from Kafka and forwarded to the GBK operator inside of Samza and
> checkpointed in Kafka before the exception is ever thrown.
>
>
>
> On Tue, Jul 9, 2019 at 8:34 PM Benenson, Mikhail <
> mikhail_benen...@intuit.com> wrote:
>
> Hi
>
>
>
> I have run a few experiments to verify if 'at least once' processing is
> guarantee on Beam 2.13.0 with Samza Runner 1.1.0
>
>
>
> Beam application is a slightly modified Stream Word Count from Beam
> examples:
>
>- read strings from input Kafka topic, print (topic, partition,
>offset, value)
>- convert values to pairs (value, 1)
>- grouping in Fixed Windows with duration 30 sec
>- sum per key
>- throw exception, if key starts with 'm'
>- write (key, sum) to output Kafka topic
>
>
>
> Tried KafkaIO.read() with and without commitOffsetsInFinalize() there is
> no difference in results.
>
>
>
> Please, see src code attached.
>
>
>
> Environment:
>
>- Run with local zk & kafka, pre-create input & output topics with 1
>partition.
>- samza.properties contains "task.commit.ms=2000". According to samza
>doc "this property determines how often a checkpoint is written. The value
>is the time between checkpoints, in milliseconds". See
>
> http://samza.apache.org/learn/documentation/latest/jobs/samza-configurations.html#checkpointing.
>Please, see samza config file and run script attached.
>
>
>
>
>
> *Scenario 1: Exception in transformation*
>
>
>
> Run
>
>- Write 'a', 'b', 'c', 'm', 'd', 'e' into input topic
>- start Beam app
>- verify, that app log contains "read from topic=XXX, part=0,
>offset=100, val: e". Because input topic has only one partition, this means
>all data have been read from Kafka.
>- wait, until app terminates, because of the exception, while
>processing 'm'
>
>
>
> Expectation
>
> The order of processing after grouping is not specified, so some data
> could be written to output topic before application terminates, but I
> expect that value=m with offset 98 and all later records must NOT be marked
> as processed, so if I restart Beam app, I expect it again throws the
> exception when processing value=m.
>
> Comment: throwing exception in transformation is not a good idea, but such
> exception could be the result of application error. So, expectation is that
> after fixing the error, and restarting Beam app, it should process the
> record that cause an error.
>
>
>
> Results
>
> After I restarted app, it does NOT re-processing value m and does not
> throws an exception. If I add new value 'f' into input topic, I see  "read
> from topic=XXX, part=0, offset=101, val: f", and after some time I see 'm'
> in the output topic. So, the record with value 'm' is NOT processed.
>
>
>
>
>
> *Scenario 2: App termination*
>
>
>
> Run
>
>- Write 'g', 'h', 'i', 'j' into input topic
>- start Beam app
>- verify, that app log contains "read from topic=XXX, part=0,
>offset=105, val: j". Because input topic has only one partition, this means
>that all data has been read from Kafka.
>- wait about 10 sec, then terminate Beam app. The idea is to terminate
>app, when, ''g', 'h', 'i', 'j' are waiting in the 30 sec Fixed Windows, but
>after  task.commit.ms=2000 pass, so offsets are committed.
>
>
>
> Expectation
>
> As records 'g', 'h', 'i', 'j'  are NOT processed, I expect that after app
> restarted, it again reads ‘g’, ‘h’, ‘I’, ‘j’ from input topic and process
> these records.
>
>
>
> Results

Re: [Java] Using a complex datastructure as Key for KV

2019-07-12 Thread Lukasz Cwik
Additional coders would be useful. Note that we usually don't have coders
for specific collection types like ArrayList but prefer to have Coders for
their general counterparts like List, Map, Iterable, 

There has been discussion in the past to make the MapCoder a deterministic
coder when a coder is required to be deterministic. There are a few people
working on schema support within Apache Beam that might be able to provide
guidance (+Reuven Lax  +Brian Hulette
).

On Fri, Jul 12, 2019 at 11:05 AM Shannon Duncan 
wrote:

> I have a working TreeMapCoder now. Got it all setup and done, and the
> GroupByKey is accepting it.
>
> Thanks for all the help. I need to read up more on contributing guidelines
> then I'll PR the coder into the SDK. Also willing to write coders for
> things such as ArrayList etc if people want them.
>
> On Fri, Jul 12, 2019 at 9:31 AM Shannon Duncan 
> wrote:
>
>> Aha, makes sense. Thanks!
>>
>> On Fri, Jul 12, 2019 at 9:26 AM Lukasz Cwik  wrote:
>>
>>> TreeMapCoder.of(StringUtf8Coder.of(), ListCoder.of(VarIntCoder.of()));
>>>
>>> On Fri, Jul 12, 2019 at 10:22 AM Shannon Duncan <
>>> joseph.dun...@liveramp.com> wrote:
>>>
>>>> So I have my custom coder created for TreeMap and I'm ready to set it...
>>>>
>>>> So my Type is "TreeMap>"
>>>>
>>>> What do I put for ".setCoder(TreeMapCoder.of(???, ???))"
>>>>
>>>> On Thu, Jul 11, 2019 at 8:21 PM Rui Wang  wrote:
>>>>
>>>>> Hi Shannon,  [1] will be a good start on coder in Java SDK.
>>>>>
>>>>>
>>>>> [1]
>>>>> https://beam.apache.org/documentation/programming-guide/#data-encoding-and-type-safety
>>>>>
>>>>> Rui
>>>>>
>>>>> On Thu, Jul 11, 2019 at 3:08 PM Shannon Duncan <
>>>>> joseph.dun...@liveramp.com> wrote:
>>>>>
>>>>>> Was able to get it to use ArrayList by doing List>
>>>>>> result = new ArrayList>();
>>>>>>
>>>>>> Then storing my keys in a separate array that I'll pass in as a side
>>>>>> input to key for the list of lists.
>>>>>>
>>>>>> Thanks for the help, lemme know more in the future about how coders
>>>>>> work and instantiate and I'd love to help contribute by adding some new
>>>>>> coders.
>>>>>>
>>>>>> - Shannon
>>>>>>
>>>>>> On Thu, Jul 11, 2019 at 4:59 PM Shannon Duncan <
>>>>>> joseph.dun...@liveramp.com> wrote:
>>>>>>
>>>>>>> Will do. Thanks. A new coder for deterministic Maps would be great
>>>>>>> in the future. Thank you!
>>>>>>>
>>>>>>> On Thu, Jul 11, 2019 at 4:58 PM Rui Wang  wrote:
>>>>>>>
>>>>>>>> I think Mike refers to ListCoder
>>>>>>>> <https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/ListCoder.java>
>>>>>>>>  which
>>>>>>>> is deterministic if its element is the same. Maybe you can search the 
>>>>>>>> repo
>>>>>>>> for examples of ListCoder?
>>>>>>>>
>>>>>>>>
>>>>>>>> -Rui
>>>>>>>>
>>>>>>>> On Thu, Jul 11, 2019 at 2:55 PM Shannon Duncan <
>>>>>>>> joseph.dun...@liveramp.com> wrote:
>>>>>>>>
>>>>>>>>> So ArrayList doesn't work either, so just a standard List?
>>>>>>>>>
>>>>>>>>> On Thu, Jul 11, 2019 at 4:53 PM Rui Wang 
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Shannon, I agree with Mike on List is a good workaround if your
>>>>>>>>>> element within list is deterministic and you are eager to make your 
>>>>>>>>>> new
>>>>>>>>>> pipeline working.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Let me send back some pointers to adding new coder later.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> -Rui
>>>>>>>>>>
>>>>>>>>&g

Re: Circular dependencies between DataflowRunner and google cloud IO

2019-07-12 Thread Lukasz Cwik
Yes, there is a dependency between Dataflow -> GCP IOs and this is expected
since Dataflow depends on parts of those implementations for its own
execution purposes. We definitely don't want GCP IOs depending on Dataflow
since we would like users of other runners to still be able to use GCP IOs
without bringing in Dataflow specific dependencies.

There is already a test definition inside of the Dataflow runner package
that is meant to run integration tests defined in the GCP IO package named
googleCloudPlatformLegacyWorkerIntegrationTest[1] task, does this fit your
needs?

1:
https://github.com/apache/beam/blob/0fce2b88660f52dae638697e1472aa108c982ae6/runners/google-cloud-dataflow-java/build.gradle#L318

On Fri, Jul 12, 2019 at 5:17 AM Michał Walenia 
wrote:

> Hi all,
> recently when I was trying to implement a performance test of BigQueryIO,
> I ran into an issue when trying to run the test on Dataflow.
> The problem was that I encountered a circular dependency when compiling
> the tests. I added the test in org.apache.beam.sdk.io.gcp.bigquery package,
> so I also needed to add DataflowRunner as a dependency in order to launch
> the test. The error was that DataflowRunner package depends on
> org.apache.beam.sdk.io.gcp.bigquery package (for example in [1]). Should
> it be like that?
> For now, in order to solve the problem, I intend to move the performance
> test to its own package in my PR [2] I am wondering about the right
> approach to this - shouldn’t we decouple the DataflowRunner code from IOs?
> If not, what’s the reason behind the way the modules are organized?
> I noticed that 5 tests are excluded from the integrationTest task in
> io.google-cloud-platform.bigquery build.gradle file [3]. Are they
> launched on Dataflow anywhere? I couldn’t find their usage except for the
> exclusions.
>
> [1] PubSubIO translations section in DataflowRunner.java
> 
> [2] My PR 
> [3] DefaultCoderCloudObjectTranslatorRegistrar
> 
>
> Best regards
> Michal
>
> --
>
> Michał Walenia
> Polidea  | Software Engineer
>
> M: +48 791 432 002 <+48791432002>
> E: michal.wale...@polidea.com
>
> Unique Tech
> Check out our projects! 
>


[VOTE] Vendored Dependencies Release

2019-07-11 Thread Lukasz Cwik
Please review the release of the following artifacts that we vendor:
 * beam-vendor-grpc_1_21_0
 * beam-vendor-guava-26_0-jre
 * beam-vendor-bytebuddy-1_9_3

Hi everyone,
Please review and vote on the release candidate #3 for the
org.apache.beam:beam-vendor-grpc_1_21_0:0.1,
org.apache.beam:beam-vendor-guava-26_0-jre:0.1, and
org.apache.beam:beam-vendor-bytebuddy-1_9_3:0.1 as follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)


The complete staging area is available for your review, which includes:
* the official Apache source release to be deployed to dist.apache.org [1],
which is signed with the key with fingerprint
EAD5DE293F4A03DD2E77565589E68A56E371CCA2 [2],
* all artifacts to be deployed to the Maven Central Repository [3],
* commit hash "0fce2b88660f52dae638697e1472aa108c982ae6" [4],

The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PMC affirmative votes.

Thanks,
Luke

[1] https://dist.apache.org/repos/dist/dev/beam/vendor/
[2] https://dist.apache.org/repos/dist/release/beam/KEYS
[3] https://repository.apache.org/content/repositories/orgapachebeam-1078/
[4]
https://github.com/apache/beam/commit/0fce2b88660f52dae638697e1472aa108c982ae6


Re: [VOTE] Vendored Dependencies Release

2019-07-10 Thread Lukasz Cwik
No, the classes weren't expected inside of Guava. Cancelling this release
candidate.

I don't believe the protos/certs/keys matter and were part of our prior
1.13.1 release as well[1].

I found out that we stopped validating the contents of the vendored jar as
part of the release process and opened up pr/9036[2] to fix our validation.

Kai, I will try to add the vendored bytebuddy to the next release, it would
be useful if you could help review the PRs related to vendoring.

1:
https://repo1.maven.org/maven2/org/apache/beam/beam-vendor-grpc-1_13_1/0.2/beam-vendor-grpc-1_13_1-0.2.jar
2: https://github.com/apache/beam/pull/9036

On Wed, Jul 10, 2019 at 1:34 PM Kai Jiang  wrote:

> pull/8357 <https://github.com/apache/beam/pull/8357> proposes to vendor
> bytebuddy artifact.
> Is it possible to release "beam-vendor-bytebuddy-1_9_3" in next release
> candidate?
>
> Best,
> Kai
>
> On Wed, Jul 10, 2019 at 11:31 AM Kenneth Knowles  wrote:
>
>> grpc: jar contains certs, keys, protos at the top level; intended?
>>
>> guava: jar contains classes not in vendored prefix, with prefixes such as
>> com/google/j2objc, org/codehaus/mojo, com/google/errorprone,
>> org/checkerframework, javax/annotation
>>
>> On Tue, Jul 9, 2019 at 3:34 PM Lukasz Cwik  wrote:
>>
>>> Please review the release of the following artifacts that we vendor:
>>>  * beam-vendor-grpc_1_21_0
>>>  * beam-vendor-guava-26_0-jre
>>>
>>> Hi everyone,
>>> Please review and vote on the release candidate #2 for the
>>> org.apache.beam:beam-vendor-grpc_1_21_0:0.1 and
>>> org.apache.beam:beam-vendor-guava-26_0-jre:0.1, as follows:
>>> [ ] +1, Approve the release
>>> [ ] -1, Do not approve the release (please provide specific comments)
>>>
>>>
>>> The complete staging area is available for your review, which includes:
>>> * the official Apache source release to be deployed to dist.apache.org
>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__dist.apache.org=DwMFaQ=r2dcLCtU9q6n0vrtnDw9vg=w9NQdMas-PbABIe2Vnjx7A=XQ3qyG5SAoY6jY4Pv3Bn3fVfQFbjZaIMJovq2d1Y-2c=csOeuLf-hGb4WMuN0WlDgcN7dYHjh1S0CGpw5i12sQ4=>
>>> [1], which is signed with the key with fingerprint
>>> EAD5DE293F4A03DD2E77565589E68A56E371CCA2 [2],
>>> * all artifacts to be deployed to the Maven Central Repository [3],
>>> * commit hash "b4efbb23cc5dec80b8bbd8745c62efecdadfa236" [4],
>>>
>>> The vote will be open for at least 72 hours. It is adopted by majority
>>> approval, with at least 3 PMC affirmative votes.
>>>
>>> Thanks,
>>> Luke
>>>
>>> [1] https://dist.apache.org/repos/dist/dev/beam/vendor/
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__dist.apache.org_repos_dist_dev_beam_vendor_=DwMFaQ=r2dcLCtU9q6n0vrtnDw9vg=w9NQdMas-PbABIe2Vnjx7A=XQ3qyG5SAoY6jY4Pv3Bn3fVfQFbjZaIMJovq2d1Y-2c=vDAP44KBSLriK6epP13aKgiALwJLZ9SKau_naQkkJHg=>
>>> [2] https://dist.apache.org/repos/dist/release/beam/KEYS
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__dist.apache.org_repos_dist_release_beam_KEYS=DwMFaQ=r2dcLCtU9q6n0vrtnDw9vg=w9NQdMas-PbABIe2Vnjx7A=XQ3qyG5SAoY6jY4Pv3Bn3fVfQFbjZaIMJovq2d1Y-2c=xpYP-T_xTF78zcT6jG8fBp4_SBEE1y7TwHvi8n2HbhE=>
>>> [3]
>>> https://repository.apache.org/content/repositories/orgapachebeam-1076/
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__repository.apache.org_content_repositories_orgapachebeam-2D1076_=DwMFaQ=r2dcLCtU9q6n0vrtnDw9vg=w9NQdMas-PbABIe2Vnjx7A=XQ3qyG5SAoY6jY4Pv3Bn3fVfQFbjZaIMJovq2d1Y-2c=sg01Cg5xZfqLbyvPHiggexIjz8xR8wxMVi4d-thIXBo=>
>>> [4]
>>> https://github.com/apache/beam/commit/b4efbb23cc5dec80b8bbd8745c62efecdadfa236
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_beam_commit_b4efbb23cc5dec80b8bbd8745c62efecdadfa236=DwMFaQ=r2dcLCtU9q6n0vrtnDw9vg=w9NQdMas-PbABIe2Vnjx7A=XQ3qyG5SAoY6jY4Pv3Bn3fVfQFbjZaIMJovq2d1Y-2c=IEe6OXABE34s9YIOPvwpwh7CbnWl8YOZTrj6476XWpI=>
>>>
>>


Re: Beam/Samza Ensuring At Least Once semantics

2019-07-10 Thread Lukasz Cwik
When you restart the application, are you resuming it from Samza's last
commit?

Since the exception is thrown after the GBK, all the data could be read
from Kafka and forwarded to the GBK operator inside of Samza and
checkpointed in Kafka before the exception is ever thrown.

On Tue, Jul 9, 2019 at 8:34 PM Benenson, Mikhail <
mikhail_benen...@intuit.com> wrote:

> Hi
>
>
>
> I have run a few experiments to verify if 'at least once' processing is
> guarantee on Beam 2.13.0 with Samza Runner 1.1.0
>
>
>
> Beam application is a slightly modified Stream Word Count from Beam
> examples:
>
>- read strings from input Kafka topic, print (topic, partition,
>offset, value)
>- convert values to pairs (value, 1)
>- grouping in Fixed Windows with duration 30 sec
>- sum per key
>- throw exception, if key starts with 'm'
>- write (key, sum) to output Kafka topic
>
>
>
> Tried KafkaIO.read() with and without commitOffsetsInFinalize() there is
> no difference in results.
>
>
>
> Please, see src code attached.
>
>
>
> Environment:
>
>- Run with local zk & kafka, pre-create input & output topics with 1
>partition.
>- samza.properties contains "task.commit.ms=2000". According to samza
>doc "this property determines how often a checkpoint is written. The value
>is the time between checkpoints, in milliseconds". See
>
> http://samza.apache.org/learn/documentation/latest/jobs/samza-configurations.html#checkpointing.
>Please, see samza config file and run script attached.
>
>
>
>
>
> *Scenario 1: Exception in transformation*
>
>
>
> Run
>
>- Write 'a', 'b', 'c', 'm', 'd', 'e' into input topic
>- start Beam app
>- verify, that app log contains "read from topic=XXX, part=0,
>offset=100, val: e". Because input topic has only one partition, this means
>all data have been read from Kafka.
>- wait, until app terminates, because of the exception, while
>processing 'm'
>
>
>
> Expectation
>
> The order of processing after grouping is not specified, so some data
> could be written to output topic before application terminates, but I
> expect that value=m with offset 98 and all later records must NOT be marked
> as processed, so if I restart Beam app, I expect it again throws the
> exception when processing value=m.
>
> Comment: throwing exception in transformation is not a good idea, but such
> exception could be the result of application error. So, expectation is that
> after fixing the error, and restarting Beam app, it should process the
> record that cause an error.
>
>
>
> Results
>
> After I restarted app, it does NOT re-processing value m and does not
> throws an exception. If I add new value 'f' into input topic, I see  "read
> from topic=XXX, part=0, offset=101, val: f", and after some time I see 'm'
> in the output topic. So, the record with value 'm' is NOT processed.
>
>
>
>
>
> *Scenario 2: App termination*
>
>
>
> Run
>
>- Write 'g', 'h', 'i', 'j' into input topic
>- start Beam app
>- verify, that app log contains "read from topic=XXX, part=0,
>offset=105, val: j". Because input topic has only one partition, this means
>that all data has been read from Kafka.
>- wait about 10 sec, then terminate Beam app. The idea is to terminate
>app, when, ''g', 'h', 'i', 'j' are waiting in the 30 sec Fixed Windows, but
>after  task.commit.ms=2000 pass, so offsets are committed.
>
>
>
> Expectation
>
> As records 'g', 'h', 'i', 'j'  are NOT processed, I expect that after app
> restarted, it again reads ‘g’, ‘h’, ‘I’, ‘j’ from input topic and process
> these records.
>
>
>
> Results
>
> After I restarted app, it does NOT re-process  ‘g’, ‘h’, ‘I’, ‘j’ values.
> If I add new value ‘k’ into input topic, I see  “read from topic=XXX,
> part=0, offset=106, val: k”, and after some time I see ‘k’ in the output
> topic. So, the records with values ‘g’, ‘h’, ‘I’, ‘j’ are NOT processed.
>
>
>
>
>
> Based on these results I’m incline to conclude that Beam with Samza runner
> does NOT provides 'at least once' guarantee for processing.
>
>
>
> If I missed something?
>
>
>
> --
>
> Michael Benenson
>
>
>
>
>
> *From: *"LeVeck, Matt" 
> *Date: *Monday, July 1, 2019 at 5:28 PM
> *To: *"Deshpande, Omkar" , "Benenson,
> Mikhail" , Xinyu Liu ,
> Xinyu Liu , Samarth Shetty ,
> "Audo, Nicholas" 
> *Subject: *Beam/Samza Ensuring At Least Once semantics
>
>
>
> We’re seeing some behavior when using Beam’s KafkaIO and Samza as the
> runner that suggests checkpoints are getting committed even when an error
> gets throwing in the Beam Pipline while processing a batch.  Do you all
> have a recommended set of settings/patterns for using Beam with Samza to
> ensure that checkpoints are only updated after successful processing (i.e.
> the transforms succeed and the message is sent to the Beam pipeline’s final
> output sink)?
>
>
>
> Our current settings for Samza are:
>
> 

Re: Phrase triggering jobs problem

2019-07-10 Thread Lukasz Cwik
This has happened in the past. Usually there is some issue where Jenkins
isn't notified of new PRs by Github or doesn't see the PR phrases and hence
Jenkins sits around idle. This is usually fixed after a few hours without
any action on our part.

On Wed, Jul 10, 2019 at 10:28 AM Katarzyna Kucharczyk <
ka.kucharc...@gmail.com> wrote:

> Hi all,
>
> Hope it's not duplicate but I can't find if any issue with phrase
> triggering in Jenkins was already here.
> Currently, I started third PR and no test were triggered there. I tried to
> trigger some tests manually, but with no effect.
>
> Am I missing something?
>
> Here are links to my problematic PRs:
> https://github.com/apache/beam/pull/9033
> https://github.com/apache/beam/pull/9034
> https://github.com/apache/beam/pull/9035
>
> Thanks,
> Kasia
>


[VOTE] Vendored Dependencies Release

2019-07-09 Thread Lukasz Cwik
Please review the release of the following artifacts that we vendor:
 * beam-vendor-grpc_1_21_0
 * beam-vendor-guava-26_0-jre

Hi everyone,
Please review and vote on the release candidate #2 for the
org.apache.beam:beam-vendor-grpc_1_21_0:0.1 and
org.apache.beam:beam-vendor-guava-26_0-jre:0.1, as follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)


The complete staging area is available for your review, which includes:
* the official Apache source release to be deployed to dist.apache.org [1],
which is signed with the key with fingerprint
EAD5DE293F4A03DD2E77565589E68A56E371CCA2 [2],
* all artifacts to be deployed to the Maven Central Repository [3],
* commit hash "b4efbb23cc5dec80b8bbd8745c62efecdadfa236" [4],

The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PMC affirmative votes.

Thanks,
Luke

[1] https://dist.apache.org/repos/dist/dev/beam/vendor/
[2] https://dist.apache.org/repos/dist/release/beam/KEYS
[3] https://repository.apache.org/content/repositories/orgapachebeam-1076/
[4]
https://github.com/apache/beam/commit/b4efbb23cc5dec80b8bbd8745c62efecdadfa236


Re: Apache Beam issue | Reading Avro files and pushing to Bigquery

2019-07-09 Thread Lukasz Cwik
+user  (please use user@ for questions about using
the product and restrict to dev@ for questions related to developing the
product).

Can you provide the rest of the failing reason (and any stacktraces from
the workers related to the failures)?

On Tue, Jul 9, 2019 at 11:04 AM Dhiraj Sardana 
wrote:

> Hello,
>
>
>
> We’re using Apache Beam with Google dataflow. We’ve a pipeline which reads
> data from google storage (Avro files), transforms the same and pushes to
> Bigquery.
>
>
>
> Error we’re getting: Pipelines processes the data for some time and then
> stuck and in the ends, fails (and data does not reach the Bigquery):
>
>
>
> Here is error snippet:
>
> Workflow failed. Causes: S92:AvroIO.ReadAll/Read all via
> FileBasedSource/Reshuffle/Reshuffle/GroupBy...
>
> Workflow failed. Causes: S92:AvroIO.ReadAll/Read all via
> FileBasedSource/Reshuffle/Reshuffle/GroupByKey/Read+AvroIO.ReadAll/Read all
> via
> FileBasedSource/Reshuffle/Reshuffle/GroupByKey/GroupByWindow+AvroIO.ReadAll/Read
> all via
> FileBasedSource/Reshuffle/Reshuffle/ExpandIterable+AvroIO.ReadAll/Read all
> via FileBasedSource/Reshuffle/Values/Values/Map+AvroIO.ReadAll/Read all via
> FileBasedSource/Read ….
>
>
>
> Code snippet:
>
> PCollection records = pipeline.apply(Create.*of*
> (fileList)).setCoder(StringUtf8Coder.*of*());
> PCollection events = records.apply(AvroIO.
> *readAllGenericRecords*(userDefinedSchema));
>
>
>
> Apache Beam version: We tried with 2.12.0 & 2.8.0 both, but same error.
>
>
>
> It would be really helpful if we can get some hints/solution to this
> problem and let us know for more info.
>
>
>
>
>
> Regards,
>
> Dhiraj
>
>
> Dhiraj Sardana
> JEE Developer
> 
> *NetEnt* | Better Gaming™
> T: +46 760 024 812 <+46%20760%20024%20812>, M: +46 760 024 812
> <+46%20760%20024%20812>
> dhiraj.sard...@netent.com, www.netent.com
> Address: NetEnt AB (publ), Vasagatan 16, 111 20, Stockholm, SE
> This email and any attachments are confidential and may be legally
> privileged and protected by copyright. If you are not the intended
> recipient of this email you should not copy it or disclose its contents to
> anyone. If you have received this email in error, please notify the sender
> immediately and delete the email. Views or opinions in this email are
> solely those of the author. Unencrypted Internet communications are not
> secure and the sender does not accept responsibility for interception of
> this message by third parties. This communication is not intended to form a
> binding contract unless expressly indicated to the contrary and properly
> authorized. The recipient should scan this email and any attachments for
> the presence of viruses. The sender accepts no liability for any viruses
> transmitted in this email.
>


Re: Unable to start BEAM sql shell

2019-07-09 Thread Lukasz Cwik
Thanks for the fixes. I have reviewed both and merged them.

On Tue, Jul 9, 2019 at 10:59 AM Kyle Weaver  wrote:

> I would also make sure that you are running the command from the root of
> the repo.
>
> Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com
> | +1650203
>
>
> On Tue, Jul 9, 2019 at 10:29 AM Rui Wang  wrote:
>
>> By applying the change above on a fresh clone from master, the command to
>> bundle SQL shell can pass on my mac. My Gradle is also 5.4.1 and it's Java
>> 8. A PR  was cut to apply this
>> change to Beam repo.
>>
>> I also found that the command to run SQL shell binary is also broken so
>> sending another PR to update SQL shell doc:
>> https://github.com/apache/beam/pull/9021
>>
>>
>> -Rui
>> 
>>
>>
>>
>> On Mon, Jul 8, 2019 at 8:15 PM Thomas K.  wrote:
>>
>>> Hi,
>>>
>>> Thanks for the advice. I made a fresh clone from master and also removed
>>> the configuration: shadow line from the build.gradle file. However. I'm
>>> still getting the same error message.
>>>
>>> A bit more about my env : I'm running Gradle 5.4.1 and Java 8 on Windows.
>>>
>>> Any thoughts?
>>>
>>> On Tue, Jul 9, 2019 at 1:44 AM Rui Wang  wrote:
>>>
 Indeed it's broken on shadow configuration.  I found that if removing
 configuration: "shadow" from [1], this command will pass.


 [1]:
 https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/shell/build.gradle#L31

 On Mon, Jul 8, 2019 at 1:10 PM Kyle Weaver  wrote:

> Hi Thomas,
>
> You probably need to make sure your clone of the Beam repo is up to
> date. Also, it looks like there's another bug so I filed a Jira:
> https://issues.apache.org/jira/browse/BEAM-7708
>
> Thanks,
> Kyle
>
> Kyle Weaver | Software Engineer | github.com/ibzib |
> kcwea...@google.com | +1650203
>
>
> On Mon, Jul 8, 2019 at 11:59 AM Thomas K. 
> wrote:
>
>> Hi,
>>
>> I'm following the instructions on this page -
>>
>> https://beam.apache.org/documentation/dsls/sql/shell/
>>
>> and running the following command
>>
>> gradlew -p sdks/java/extensions/sql/shell 
>> -Pbeam.sql.shell.bundled=':runners:flink:1.5,:sdks:java:io:kafka' 
>> installDist
>>
>>
>>
>>
>>
>> However, it fails with the error:
>> A problem occurred evaluating project
>> ':sdks:java:extensions:sql:shell'.
>> > Project with path '':runners:flink:1.5' could not be found in
>> project ':sdks:java:extensions:sql:shell'.
>>
>>
>>
>> How do I get it to  recognize all the extensions so that I can run
>> the SQL shell?
>>
>> Thanks.
>>
>>


Re: [Discuss] Create stackoverflow tags for python, java and go SDKs?

2019-07-09 Thread Lukasz Cwik
That sounds like a good idea to me.

On Wed, Jul 3, 2019 at 10:45 AM Rui Wang  wrote:

> Hi Community,
>
> When reading apache-beam related questions in stackoverflow, it happens
> that some questions only mention version number(e.g. 2.8.0) but not mention
> which SDK related. Sometimes I can tell which SDK it is from code snippets,
> sometime I cannot as there is no code snippet. So in order to answer those
> questions I need to first comment and ask which SDK.
>
> I noticed that there is no tag for a specific SDK for apache beam. Adding
> such tags will be helpful when
> 1. Questions with such tag tell which SDK it is talking about.
> 2. If Questions do not mention SDK and without such tag, I can (or anyone
> else) help tag them.
>
> Note that creating tags is a privilege in SO that requires >1500
> reputation[1]. If people generally are ok with this idea, we will need to
> ask for help in the community to see who could be able to create tags.
>
>
> [1]: https://stackoverflow.com/help/privileges/create-tags
>
> Rui
>


Re: [VOTE] Vendored dependencies release process

2019-07-08 Thread Lukasz Cwik
Thanks for taking a look. I followed up on your questions.

On Mon, Jul 8, 2019 at 3:58 PM Udi Meiri  wrote:

> I left some comments. Being new to the Beam releasing process, my question
> might be trivial to someone actually performing the release.
>
> On Tue, Jul 2, 2019 at 4:49 PM Lukasz Cwik  wrote:
>
>> Please vote based on the vendored dependencies release process as
>> discussed[1] and documented[2].
>>
>> Please vote as follows:
>> +1: Adopt the vendored dependency release process
>> -1: The vendored release process needs to change because ...
>>
>> Since many people in the US may be out due to the holiday schedule, I'll
>> try to close the vote and tally the results on July 9th so please vote
>> before then.
>>
>> 1:
>> https://lists.apache.org/thread.html/e2c49a5efaee2ad416b083fbf3b9b6db60fdb04750208bfc34cecaf0@%3Cdev.beam.apache.org%3E
>> 2: https://s.apache.org/beam-release-vendored-artifacts
>>
>


Re: [VOTE] Vendored dependencies release process

2019-07-06 Thread Lukasz Cwik
+1

On Wed, Jul 3, 2019 at 10:24 AM Jens Nyman  wrote:

> +1
>
> On 2019/07/02 23:49:10, Lukasz Cwik  wrote:
> > Please vote based on the vendored dependencies release process as>
> > discussed[1] and documented[2].>
> >
> > Please vote as follows:>
> > +1: Adopt the vendored dependency release process>
> > -1: The vendored release process needs to change because ...>
> >
> > Since many people in the US may be out due to the holiday schedule,
> I'll>
> > try to close the vote and tally the results on July 9th so please vote>
> > before then.>
> >
> > 1:>
> >
> https://lists.apache.org/thread.html/e2c49a5efaee2ad416b083fbf3b9b6db60fdb04750208bfc34cecaf0@%3Cdev.beam.apache.org%3E>
>
> > 2: https://s.apache.org/beam-release-vendored-artifacts>
> >
>


Re: Stop using Perfkit Benchmarker tool in all tests?

2019-07-03 Thread Lukasz Cwik
Makes sense to me to move forward with your suggestion.

On Wed, Jul 3, 2019 at 3:57 AM Łukasz Gajowy 
wrote:

> Are there features in Perfkit that we would like to be using that we
>> aren't?
>>
>
> Besides the Kubernetes related code I mentioned above (that, I believe,
> can be easily replaced) I don't see any added value in having Perfkit. The
> Kubernetes parts could be replaced with a set of fine-grained Gradle tasks
> invoked by other high-level tasks and Jenkins job's steps. There also seem
> to be some Gradle + Kubernetes plugins out there that might prove useful
> here (no solid research in that area).
>
>
>> Can we make the integration with Perfkit less brittle?
>>
>
> There was an idea to move all beam benchmark's code from Perfkit (
> beam_benchmark_helper.py
> <https://github.com/GoogleCloudPlatform/PerfKitBenchmarker/blob/5680e174ad1799056b4b6d4a6600ef9f93fe39ad/perfkitbenchmarker/beam_benchmark_helper.py>
> , beam_integration_benchmark.py
> <https://github.com/GoogleCloudPlatform/PerfKitBenchmarker/blob/7cdcea2561c66baa838e3ce4d776236a248e6700/perfkitbenchmarker/linux_benchmarks/beam_integration_benchmark.py>)
> to beam repository and inject it to Perfkit every time we use it. However,
> that would require investing time and effort in doing that and it will
> still not solve the problems I listed above. It will also still require
> knowledge of how Perfkit works from Beam developers while we can avoid that
> and use the existing tools (gradle, jenkins).
>
> Thanks!
>
> pt., 28 cze 2019 o 17:31 Lukasz Cwik  napisał(a):
>
>> +1 for removing tests that are not maintained.
>>
>> Are there features in Perfkit that we would like to be using that we
>> aren't?
>> Can we make the integration with Perfkit less brittle?
>>
>> If we aren't getting much and don't plan to get much value in the short
>> term, removal makes sense to me.
>>
>> On Thu, Jun 27, 2019 at 3:16 AM Łukasz Gajowy  wrote:
>>
>>> Hi all,
>>>
>>> moving the discussion to the dev list:
>>> https://github.com/apache/beam/pull/8919. I think that Perfkit
>>> Benchmarker should be removed from all our tests.
>>>
>>> Problems that we face currently:
>>>
>>>1. Changes to Gradle tasks/build configuration in the Beam codebase
>>>have to be reflected in Perfkit code. This required PRs to Perfkit which
>>>can last and the tests break due to this sometimes (no change in Perfkit 
>>> +
>>>change already there in beam = incompatibility). This is what happened in
>>>PR 8919 (above),
>>>2. Can't run in Python3 (depends on python 2 only library like
>>>functools32),
>>>3. Black box testing which hard to collect pipeline related metrics,
>>>4. Measurement of run time is inaccurate,
>>>5. It offers relatively small elasticity in comparison with eg.
>>>Jenkins tasks in terms of setting up the testing infrastructure (runners,
>>>databases). For example, if we'd like to setup Flink runner, and reuse it
>>>in consequent tests in one go, that would be impossible. We can easily do
>>>this in Jenkins.
>>>
>>> Tests that use Perfkit:
>>>
>>>1.  IO integration tests,
>>>2.  Python performance tests,
>>>3.  beam_PerformanceTests_Dataflow (disabled),
>>>4.  beam_PerformanceTests_Spark (failing constantly - looks not
>>>maintained).
>>>
>>> From the IOIT perspective (1), only the code that setups/tears down
>>> Kubernetes resources is useful right now but these parts can be easily
>>> implemented in Jenkins/Gradle code. That would make Perfkit obsolete in
>>> IOIT because we already collect metrics using Metrics API and store them in
>>> BigQuery directly.
>>>
>>> As for point 2: I have no knowledge of how complex the task would be
>>> (help needed).
>>>
>>> Regarding 3, 4: Those tests seem to be not maintained - should we remove
>>> them?
>>>
>>> Opinions?
>>>
>>> Thank you,
>>> Łukasz
>>>
>>>
>>>
>>>
>>>


Re: Wiki access?

2019-07-03 Thread Lukasz Cwik
I have added you. Thanks for helping out with the docs.

On Wed, Jul 3, 2019 at 8:22 AM Ryan Skraba  wrote:

> Oof, sorry: ryanskraba
>
> Thanks in advance!  There's a lot of great info in there.
>
> On Wed, Jul 3, 2019 at 5:03 PM Lukasz Cwik  wrote:
>
>> Can you share your login id for cwiki.apache.org?
>>
>> On Wed, Jul 3, 2019 at 7:21 AM Ryan Skraba  wrote:
>>
>>> Hello -- I've been reading through a lot of Beam documentation recently,
>>> and noting minor typos here and there... Is it possible to get Wiki access
>>> to make fixes on the spot?
>>>
>>> Best regards, Ryan
>>>
>>


Re: Wiki access?

2019-07-03 Thread Lukasz Cwik
Can you share your login id for cwiki.apache.org?

On Wed, Jul 3, 2019 at 7:21 AM Ryan Skraba  wrote:

> Hello -- I've been reading through a lot of Beam documentation recently,
> and noting minor typos here and there... Is it possible to get Wiki access
> to make fixes on the spot?
>
> Best regards, Ryan
>


[VOTE] Vendored dependencies release process

2019-07-02 Thread Lukasz Cwik
Please vote based on the vendored dependencies release process as
discussed[1] and documented[2].

Please vote as follows:
+1: Adopt the vendored dependency release process
-1: The vendored release process needs to change because ...

Since many people in the US may be out due to the holiday schedule, I'll
try to close the vote and tally the results on July 9th so please vote
before then.

1:
https://lists.apache.org/thread.html/e2c49a5efaee2ad416b083fbf3b9b6db60fdb04750208bfc34cecaf0@%3Cdev.beam.apache.org%3E
2: https://s.apache.org/beam-release-vendored-artifacts


Re: Change of Behavior - JDBC Set Command

2019-07-02 Thread Lukasz Cwik
I see, in the current PR it seems like we are trying to adopt the parsing
logic of PipelineOptions command line value parsing to all SQL usecases
since we are exposing the parseOption method to be used in the
PipelineOptionsReflectionSetter#setOption.

I should have asked in my earlier e-mail whether we wanted string to value
parsing to match what we do inside of the PipelineOptionsFactory. If no,
then PipelineOptionsReflectionSetter#setOption should take an Object type
for value instead of String.

On Tue, Jul 2, 2019 at 9:39 AM Anton Kedin  wrote:

> The proposed API assumes you already have a property name and a value
> parsed somehow, and now want to update a field on a pre-existing options
> object with that value, so there is no assumption about parsing being the
> same or not. E.g. if you set a property called `runner` to a string value
> `DirectRunner`, it should behave the same way whether it came from command
> line args, SQL SET command, JDBC connection args, or anywhere else.
>
> That said, we parse SET command differently from command line args [1]. We
> also parse the pipeline options from the connection args [2] that has a
> different syntax as well. I don't know whether we can easily deal with this
> aspect at this point (and whether we should), but if a value can get
> parsed, idea is that it should work the same way after that.
>
> [1]
> https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/codegen/includes/parserImpls.ftl#L307
> [2]
> https://github.com/apache/beam/blob/b2fd4e392ede19f03a48997252970b8bba8535f1/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/JdbcConnection.java#L82
>
> On Fri, Jun 28, 2019 at 7:57 AM Lukasz Cwik  wrote:
>
>> Do we want SQL argument parsing to always be 1-1 with how command line
>> parsing is being done?
>> Note that this is different from the JSON <-> PipelineOptions conversion.
>>
>> I can see why the wrapper makes sense, just want to make sure that the
>> JDBC SET command aligns with what we are trying to expose.
>>
>> On Thu, Jun 27, 2019 at 9:17 PM Anton Kedin  wrote:
>>
>>> I think we thought about this approach but decided to get rid of the map
>>> representation wherever we can while still supporting setting of the
>>> options by name.
>>>
>>> One of the lesser important downsides of keeping the map around is that
>>> we will need to do `fromArgs` at least twice.
>>>
>>> Another downside is that we will probably have to keep and maintain two
>>> representations of the pipeline options at the same time and have extra
>>> validations and probably reconciliation logic.
>>>
>>> We need the map representation in the JDBC/command-line use case where
>>> it's the only way for a user to specify the options. A user runs a special
>>> SQL command which goes through normal parsing and execution logic. On top
>>> of that we have a case of mixed Java/SQL pipelines, where we already have
>>> an instance of PipelineOptions and don't need a user to set the options
>>> from within a query. Right now this is impossible for other reasons as
>>> well. But to support both JDBC and Java+SQL use cases we currently pass
>>> both a map and a PipelineOptions instance around. Which makes things
>>> confusing. We can probably reduce passing things around but I think we will
>>> still need to keep both representations.
>>>
>>> Ideally, I think, mixed Java+SQL pipelines should be backed by that same
>>> JDBC logic as much as possible. So potentially we should allow users to set
>>> the pipeline options from within a complicated query even in SqlTransform
>>> in a Java pipeline. However setting an option from within SQL persists it
>>> in the map, but in mixed case we already have the PipelineOption instance
>>> that we got from the SqlTransform. So now we will need to maintain the
>>> logic to reconcile the two representations. That will probably involve
>>> either something similar to the proposed reflection approach, or
>>> serializing both representations to a map or JSON and then reconciling and
>>> then reconstructing it from there. This sounds unnecessary and we can avoid
>>> this if we are able to just set the pipeline options by name in the first
>>> place. In that case we can just use whatever PipelineOptions instance we
>>> have at the moment without extra validation / reconciliation.
>>>
>>> Hope this makes sense.
>>>
>>> Regards,
>>> Anton
>>>
>>> On Thu, Jun 27, 2019 at 4:38 PM Lukasz Cwik  wrote:
>>>

Re: BQ IO GC thrashing when specifying .withMethod(STREAMING_INSERTS)

2019-07-01 Thread Lukasz Cwik
I think the BQ streaming writes buffers data into batches and sends them
when used with STREAMING_INSERTS.

Have you been able to ask the user to get a heap dump to see what was using
the majority of memory?

On Mon, Jul 1, 2019 at 2:34 PM Lukasz Cwik  wrote:

> I think the Bq
>
> On Mon, Jul 1, 2019 at 10:54 AM Mikhail Gryzykhin 
> wrote:
>
>> Hello everybody,
>>
>> This question is regarding user post on StackOverflow
>> <https://stackoverflow.com/questions/56823629/gcp-dataflow-running-streaming-inserts-into-bigquery-gc-thrashing>
>> .
>>
>> My understanding of problem is that
>> setting .withMethod(STREAMING_INSERTS) on BigQueryIO sink causes GC
>> thrashing on big amount of entries.
>>
>> Is there a known issue or information how to start triaging this?
>>
>> Search on Jira shown me this ticket, but it is not directly connected
>> with the issue: https://issues.apache.org/jira/browse/BEAM-7666
>>
>> Thank you,
>> Mikhail.
>>
>


Re: BQ IO GC thrashing when specifying .withMethod(STREAMING_INSERTS)

2019-07-01 Thread Lukasz Cwik
I think the Bq

On Mon, Jul 1, 2019 at 10:54 AM Mikhail Gryzykhin  wrote:

> Hello everybody,
>
> This question is regarding user post on StackOverflow
> 
> .
>
> My understanding of problem is that setting .withMethod(STREAMING_INSERTS)
> on BigQueryIO sink causes GC thrashing on big amount of entries.
>
> Is there a known issue or information how to start triaging this?
>
> Search on Jira shown me this ticket, but it is not directly connected with
> the issue: https://issues.apache.org/jira/browse/BEAM-7666
>
> Thank you,
> Mikhail.
>


Re: [DISCUSS] Solving timer ordering on immutable bundles

2019-06-28 Thread Lukasz Cwik
Thanks for the explanation.

On Fri, Jun 28, 2019 at 6:49 AM Reuven Lax  wrote:

> This happens when the watermark hops forward. In practice whenever there
> is any backlog, this is the normal mode of operation.
>
> On Fri, Jun 28, 2019, 12:42 AM Lukasz Cwik  wrote:
>
>> Earlier it was said that performance was poor if we moved to a model
>> where we prevented multiple timer firings. Since timer firings are per key,
>> can you provide details of what use case has multiple user timer firings
>> per key?
>>
>> On Thu, Jun 27, 2019 at 4:34 PM Reuven Lax  wrote:
>>
>>> The watermark holds (which is how the timer holds up the watermark
>>> today, as there is no timer watermark) is per key. Usually the input
>>> watermark making a "hop" is not a problem, in fact it's the normal state of
>>> affairs.
>>>
>>> On Fri, Jun 28, 2019 at 1:08 AM Lukasz Cwik  wrote:
>>>
>>>> Thanks Reuven and Jan.
>>>>
>>>> Since timers are per key, wouldn't it be that the timer watermark
>>>> should also be per key for a StatefulDoFn and hence we would still be able
>>>> to fire multiple timers (at most one per key) and still have good
>>>> performance even when the input watermark makes a "hop"?
>>>>
>>>>
>>>> On Thu, Jun 27, 2019 at 3:43 PM Jan Lukavský  wrote:
>>>>
>>>>> It would be possible to have "timer watermark", between input and
>>>>> output watermark, so that input watermark >= timer watermark >= output
>>>>> watermark, but it turns out, that doing so implies that we fire timers 
>>>>> only
>>>>> for single instant (because until the timer is fired and processed, the
>>>>> "timer watermark" is on hold).
>>>>> On 6/28/19 12:40 AM, Jan Lukavský wrote:
>>>>>
>>>>> At least the implementation in DirectRunner fires timers according to
>>>>> input watemark. Holding the timer up to output watermark causes deadlocks,
>>>>> because timers fired at time T might clear watermark hold for the same 
>>>>> time.
>>>>> On 6/27/19 11:55 PM, Reuven Lax wrote:
>>>>>
>>>>> I believe that timers correspond to watermark holds, which hold up the
>>>>> output watermark, not the input watermark.
>>>>>
>>>>> On Thu, Jun 27, 2019 at 11:21 PM Lukasz Cwik  wrote:
>>>>>
>>>>>> I'm confused as to why it is valid to advance the watermark to T3 in
>>>>>> the original scenario.
>>>>>>
>>>>>> T1 and T2 should be treated as inputs to the function and hold the
>>>>>> input watermark hence T1 should fire and if it doesn't produce any new
>>>>>> timers before T2, then T2 should fire since the watermark will now 
>>>>>> advance
>>>>>> to T2. The only time you would have multiple watermark timers fire as 
>>>>>> part
>>>>>> of the same bundle is if they were distinct timers both set to the same
>>>>>> time.
>>>>>>
>>>>>> I have some examples[1] documented in the modelling, scheduling, and
>>>>>> executing timers doc.
>>>>>>
>>>>>> 1:
>>>>>> https://docs.google.com/document/d/1GRL88rKLHbMR0zJnBHYwM4xtj66VYlB112EWVUFcGB0/edit#heading=h.fzptl5h0vi9k
>>>>>>
>>>>>>
>>>>>> On Wed, Jun 26, 2019 at 6:40 AM Reuven Lax  wrote:
>>>>>>
>>>>>>> Earlier than the input watermark only applies to event time timers,
>>>>>>> but the above problem holds for processing time timers as well.
>>>>>>>
>>>>>>> On Wed, Jun 26, 2019, 1:50 PM Robert Bradshaw 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Yeah, it wouldn't be optimal performance-wise, but I think it's good
>>>>>>>> to keep the bar for a correct SDK low. Might still be better than
>>>>>>>> sending one timer per bundle, and you only pay the performance if
>>>>>>>> timers are set earlier than the input watermark (and there was a
>>>>>>>> timer
>>>>>>>> firing in this range). (How often this happens probably varies a lot
>>>>>>>> in practice.)
>>>>>>>>
>>>>>>>> On Wed, Jun 26, 2

Re: Stop using Perfkit Benchmarker tool in all tests?

2019-06-28 Thread Lukasz Cwik
+1 for removing tests that are not maintained.

Are there features in Perfkit that we would like to be using that we aren't?
Can we make the integration with Perfkit less brittle?

If we aren't getting much and don't plan to get much value in the short
term, removal makes sense to me.

On Thu, Jun 27, 2019 at 3:16 AM Łukasz Gajowy  wrote:

> Hi all,
>
> moving the discussion to the dev list:
> https://github.com/apache/beam/pull/8919. I think that Perfkit
> Benchmarker should be removed from all our tests.
>
> Problems that we face currently:
>
>1. Changes to Gradle tasks/build configuration in the Beam codebase
>have to be reflected in Perfkit code. This required PRs to Perfkit which
>can last and the tests break due to this sometimes (no change in Perfkit +
>change already there in beam = incompatibility). This is what happened in
>PR 8919 (above),
>2. Can't run in Python3 (depends on python 2 only library like
>functools32),
>3. Black box testing which hard to collect pipeline related metrics,
>4. Measurement of run time is inaccurate,
>5. It offers relatively small elasticity in comparison with eg.
>Jenkins tasks in terms of setting up the testing infrastructure (runners,
>databases). For example, if we'd like to setup Flink runner, and reuse it
>in consequent tests in one go, that would be impossible. We can easily do
>this in Jenkins.
>
> Tests that use Perfkit:
>
>1.  IO integration tests,
>2.  Python performance tests,
>3.  beam_PerformanceTests_Dataflow (disabled),
>4.  beam_PerformanceTests_Spark (failing constantly - looks not
>maintained).
>
> From the IOIT perspective (1), only the code that setups/tears down
> Kubernetes resources is useful right now but these parts can be easily
> implemented in Jenkins/Gradle code. That would make Perfkit obsolete in
> IOIT because we already collect metrics using Metrics API and store them in
> BigQuery directly.
>
> As for point 2: I have no knowledge of how complex the task would be (help
> needed).
>
> Regarding 3, 4: Those tests seem to be not maintained - should we remove
> them?
>
> Opinions?
>
> Thank you,
> Łukasz
>
>
>
>
>


Re: gRPC method to get a pipeline definition?

2019-06-28 Thread Lukasz Cwik
+dev 

On Fri, Jun 28, 2019 at 8:20 AM Chad Dombrova  wrote:

>
> I think the simplest solution would be to have some kind of override/hook
>> that allows Flink/Spark/... to provide storage. They already have a concept
>> of a job and know how to store them so can we piggyback the Beam pipeline
>> there.
>>
>
> That makes sense to me, since it avoids adding a dependency on a database
> like Mongo, which adds complexity to the deployment.  That said, Beam's
> definition of a job is different from Flink/Spark/etc.  To support this, a
> runner would need to support storing arbitrary metadata, so that the Beam
> Job Service could store a copy of each Beam job there (pipeline, pipeline
> options, etc), either directly as serialized protobuf messages, or by
> converting those to json.  Do you know offhand if Flink and Spark support
> that kind of arbitrary storage?
>
> -chad
>
>
>


Re: Change of Behavior - JDBC Set Command

2019-06-28 Thread Lukasz Cwik
Do we want SQL argument parsing to always be 1-1 with how command line
parsing is being done?
Note that this is different from the JSON <-> PipelineOptions conversion.

I can see why the wrapper makes sense, just want to make sure that the JDBC
SET command aligns with what we are trying to expose.

On Thu, Jun 27, 2019 at 9:17 PM Anton Kedin  wrote:

> I think we thought about this approach but decided to get rid of the map
> representation wherever we can while still supporting setting of the
> options by name.
>
> One of the lesser important downsides of keeping the map around is that we
> will need to do `fromArgs` at least twice.
>
> Another downside is that we will probably have to keep and maintain two
> representations of the pipeline options at the same time and have extra
> validations and probably reconciliation logic.
>
> We need the map representation in the JDBC/command-line use case where
> it's the only way for a user to specify the options. A user runs a special
> SQL command which goes through normal parsing and execution logic. On top
> of that we have a case of mixed Java/SQL pipelines, where we already have
> an instance of PipelineOptions and don't need a user to set the options
> from within a query. Right now this is impossible for other reasons as
> well. But to support both JDBC and Java+SQL use cases we currently pass
> both a map and a PipelineOptions instance around. Which makes things
> confusing. We can probably reduce passing things around but I think we will
> still need to keep both representations.
>
> Ideally, I think, mixed Java+SQL pipelines should be backed by that same
> JDBC logic as much as possible. So potentially we should allow users to set
> the pipeline options from within a complicated query even in SqlTransform
> in a Java pipeline. However setting an option from within SQL persists it
> in the map, but in mixed case we already have the PipelineOption instance
> that we got from the SqlTransform. So now we will need to maintain the
> logic to reconcile the two representations. That will probably involve
> either something similar to the proposed reflection approach, or
> serializing both representations to a map or JSON and then reconciling and
> then reconstructing it from there. This sounds unnecessary and we can avoid
> this if we are able to just set the pipeline options by name in the first
> place. In that case we can just use whatever PipelineOptions instance we
> have at the moment without extra validation / reconciliation.
>
> Hope this makes sense.
>
> Regards,
> Anton
>
> On Thu, Jun 27, 2019 at 4:38 PM Lukasz Cwik  wrote:
>
>> Not sure, based upon the JIRA description it seems like you want early
>> validation of PipelineOptions. Couldn't you maintain the map of pipeline
>> options and every time one is added call PipelineOptionsFactory.fromArgs
>> discarding the result just for the error checking?
>>
>> On Tue, Jun 25, 2019 at 10:12 AM Alireza Samadian 
>> wrote:
>>
>>> Not sure. One solution might be moving the
>>> PipelineOptionsReflectionSetter class to SQL package and make it package
>>> private. This will prevent the exposure but the downside would be I need to
>>> make PipelineOptionsFactory.parseObjects() public or duplicate its code. Do
>>> you think this approach might be better? I would also appreciate if you
>>> have another suggestion to solve this.
>>>
>>> Best,
>>> Alireza
>>>
>>> On Tue, Jun 25, 2019 at 8:40 AM Lukasz Cwik  wrote:
>>>
>>>> That makes sense. I took a look at your PR, is there a way to do it
>>>> without exposing the reflection capabilities to pipeline authors?
>>>>
>>>> On Mon, Jun 24, 2019 at 2:20 PM Alireza Samadian 
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I am writing to ask if it is OK to slightly change the behaviour of
>>>>> SET command in JDBC connection of Beam SQL. Currently, if we try to use 
>>>>> set
>>>>> command for an option that does not exist or setting an option to an
>>>>> illegal value, it does not show any error until we run a query. This means
>>>>> one can potentially set it incorrectly and then reset it correctly and run
>>>>> query without getting any error. However, I want to make some changes in
>>>>> JDBC Driver that causes this behavior to be changed. After this change, if
>>>>> someone tries to use set command for a wrong pipeline option (in JDBC
>>>>> path), it will immediately see an error message.
>>>>>
>>>>> T

Re: [DISCUSS] Solving timer ordering on immutable bundles

2019-06-27 Thread Lukasz Cwik
Earlier it was said that performance was poor if we moved to a model where
we prevented multiple timer firings. Since timer firings are per key, can
you provide details of what use case has multiple user timer firings per
key?

On Thu, Jun 27, 2019 at 4:34 PM Reuven Lax  wrote:

> The watermark holds (which is how the timer holds up the watermark today,
> as there is no timer watermark) is per key. Usually the input watermark
> making a "hop" is not a problem, in fact it's the normal state of affairs.
>
> On Fri, Jun 28, 2019 at 1:08 AM Lukasz Cwik  wrote:
>
>> Thanks Reuven and Jan.
>>
>> Since timers are per key, wouldn't it be that the timer watermark should
>> also be per key for a StatefulDoFn and hence we would still be able to fire
>> multiple timers (at most one per key) and still have good performance even
>> when the input watermark makes a "hop"?
>>
>>
>> On Thu, Jun 27, 2019 at 3:43 PM Jan Lukavský  wrote:
>>
>>> It would be possible to have "timer watermark", between input and output
>>> watermark, so that input watermark >= timer watermark >= output watermark,
>>> but it turns out, that doing so implies that we fire timers only for single
>>> instant (because until the timer is fired and processed, the "timer
>>> watermark" is on hold).
>>> On 6/28/19 12:40 AM, Jan Lukavský wrote:
>>>
>>> At least the implementation in DirectRunner fires timers according to
>>> input watemark. Holding the timer up to output watermark causes deadlocks,
>>> because timers fired at time T might clear watermark hold for the same time.
>>> On 6/27/19 11:55 PM, Reuven Lax wrote:
>>>
>>> I believe that timers correspond to watermark holds, which hold up the
>>> output watermark, not the input watermark.
>>>
>>> On Thu, Jun 27, 2019 at 11:21 PM Lukasz Cwik  wrote:
>>>
>>>> I'm confused as to why it is valid to advance the watermark to T3 in
>>>> the original scenario.
>>>>
>>>> T1 and T2 should be treated as inputs to the function and hold the
>>>> input watermark hence T1 should fire and if it doesn't produce any new
>>>> timers before T2, then T2 should fire since the watermark will now advance
>>>> to T2. The only time you would have multiple watermark timers fire as part
>>>> of the same bundle is if they were distinct timers both set to the same
>>>> time.
>>>>
>>>> I have some examples[1] documented in the modelling, scheduling, and
>>>> executing timers doc.
>>>>
>>>> 1:
>>>> https://docs.google.com/document/d/1GRL88rKLHbMR0zJnBHYwM4xtj66VYlB112EWVUFcGB0/edit#heading=h.fzptl5h0vi9k
>>>>
>>>>
>>>> On Wed, Jun 26, 2019 at 6:40 AM Reuven Lax  wrote:
>>>>
>>>>> Earlier than the input watermark only applies to event time timers,
>>>>> but the above problem holds for processing time timers as well.
>>>>>
>>>>> On Wed, Jun 26, 2019, 1:50 PM Robert Bradshaw 
>>>>> wrote:
>>>>>
>>>>>> Yeah, it wouldn't be optimal performance-wise, but I think it's good
>>>>>> to keep the bar for a correct SDK low. Might still be better than
>>>>>> sending one timer per bundle, and you only pay the performance if
>>>>>> timers are set earlier than the input watermark (and there was a timer
>>>>>> firing in this range). (How often this happens probably varies a lot
>>>>>> in practice.)
>>>>>>
>>>>>> On Wed, Jun 26, 2019 at 2:33 PM Reuven Lax  wrote:
>>>>>> >
>>>>>> > This would have a lot of performance problems (especially since
>>>>>> there is user code that caches within a bundle, and invalidates the cache
>>>>>> at the end of every bundle). However this would be a valid "lazy"
>>>>>> implementation.
>>>>>> >
>>>>>> > On Wed, Jun 26, 2019 at 2:29 PM Robert Bradshaw <
>>>>>> rober...@google.com> wrote:
>>>>>> >>
>>>>>> >> Note also that a "lazy" SDK implementation would be to simply
>>>>>> return
>>>>>> >> all the timers (as if they were new timers) to runner once a timer
>>>>>> set
>>>>>> >> (before or at the last requested timer in the bundle) is
>>>>>> encountered.
>>>

Re: [DISCUSS] Releasing Vendored Artifacts

2019-06-27 Thread Lukasz Cwik
Thanks Ismael for the feedback on the doc. If there isn't any additional
feedback, I will start a process vote on the release procedure of vendored
artifacts on Tuesday.

On Tue, Jun 25, 2019 at 10:24 AM Lukasz Cwik  wrote:

> Ismael mentioned[1] that there is confusion about how to release and
> validate vendored artifacts.
>
> I have created this doc[2] and could use guidance from the community to
> validate its contents. Feel free to comment on the doc or this thread. Note
> that I used our release guide[3] as a basis for drafting it and removed all
> the sections that didn't seem relevant.
>
> I'll put the doc to a vote after a few days of feedback to accept it as a
> process change and have it incorporated into the website.
>
> 1:
> https://lists.apache.org/thread.html/bed6971afb751b2facf9c48bad76ce025281b1d3554813e35c0b4c52@%3Cdev.beam.apache.org%3E
> 2: https://s.apache.org/beam-release-vendored-artifacts
> 3: https://beam.apache.org/contribute/release-guide/
>


Re: DirectRunner timers are not strictly time ordered

2019-06-27 Thread Lukasz Cwik
Jan, I do believe that BEAM-2535 is related since the input time holds the
input watermark and will allow people to set timers which will fire in the
order that they want. This would allow users to say fire at X but I will
only create a new timer at X+Y which would allow the input watermark to
advance to X+Y allowing multiple timers to fire.

On Thu, Jun 20, 2019 at 1:45 PM Jan Lukavský  wrote:

> I'm not sure if I understand it in detail, but I'd say that it is related.
> Both issues probably have common cause - that is, in the scenario, where
> timers are fired in bundle, the issue is that when watermark moves from
> time T0 to T3 (from the last example), there is actually _no time between
> T0 and T3_, because all times in between are effectively "collapsed" into
> single instant. The "timer watermark" specified in [BEAM-2535] is something
> I also experimented with at first (in DirectRunner), but it then turned
> out, that it is equivalent to firing only timers for lowest timestamp.
> On 6/20/19 9:52 PM, Reuven Lax wrote:
>
> I think BEAM-2535 is independent.
>
> On Thu, Jun 20, 2019 at 9:47 PM Lukasz Cwik  wrote:
>
>> Does BEAM-2535 provide more context?
>>
>> On Thu, Jun 20, 2019 at 12:44 PM Reuven Lax  wrote:
>>
>>>
>>>
>>> On Thu, Jun 20, 2019 at 9:35 PM Jan Lukavský  wrote:
>>>
>>>>
>>>> On 6/20/19 9:30 PM, Reuven Lax wrote:
>>>>
>>>>
>>>>
>>>> On Thu, Jun 20, 2019 at 8:54 PM Jan Lukavský  wrote:
>>>>
>>>>> > But that is exactly how time advances. Watermarks often don't move
>>>>> smoothly, as a single old element can hold up the watermark. When that
>>>>> element is finished, the watermark can jump forward in time, triggering
>>>>> many timers.
>>>>>
>>>>> Sure. Absolutely agree. But the move from time T1 to T2 can be viewed
>>>>> as discrete jump, or smooth move, so that when you fire timer, any 
>>>>> internal
>>>>> timings are set to the actual timestamp of the timer. I believe that is 
>>>>> how
>>>>> flink works. And this might be related to the fact that Flink lacks 
>>>>> concept
>>>>> of bundles.
>>>>>
>>>>> > I'm not sure how this breaks that invariant. The input watermark has
>>>>> only moved forward, as should be true fo the output watermark. The output
>>>>> watermark is help up by watermark holds in the step, which usually means
>>>>> that the output watermark is already being help to the earliest pending
>>>>> timer.
>>>>>
>>>>> The problem was stated at the beginning of this thread. I can restate
>>>>> it:
>>>>>
>>>>> - let's have four times - T0 < T1 < T2 < T3
>>>>>
>>>>> - let's have a two timers A and B, set for time T1 and T3, respectively
>>>>>
>>>>> - watermark moves time from T0 to T3
>>>>>
>>>>> - that move fires both timers A and B (in this order), *but* timer A
>>>>> is free to set more timers, let's suppose it sets timer for T2
>>>>>
>>>>> - the second instance of timer A (set for T2) will fire *after* timer
>>>>> B (set for T3), breaking time invariant
>>>>>
>>>> Ah, by time invariant you mean the in-order firing of timers?]
>>>>
>>>> Yes, sorry, I meant "time monotonicity invariant with relation to
>>>> timers". Basically that timers should be fired in timestamp order, because
>>>> otherwise it might cause unpredictable results.
>>>>
>>>
>>> I think there were similar issues with resetting timers. If you reset a
>>> timer to a different timestamp, but a firing of that timer is already in
>>> the bundle at the old timestamp. I believe that either choice (modify the
>>> bundle or allow the timer to fire) can lead to consistency problems. Kenn
>>> might remember the details here.
>>>
>>>
>>>
>>>>
>>>>
>>>>> Jan
>>>>> On 6/20/19 8:43 PM, Reuven Lax wrote:
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Jun 20, 2019 at 8:03 PM Jan Lukavský  wrote:
>>>>>
>>>>>> Hi Reuven,
>>>>>>
>>>>>> > I would be cautious changing this. Being able to put multiple
>>>>>> timers in the same bundle saves a lot, a

Re: [DISCUSS] Solving timer ordering on immutable bundles

2019-06-27 Thread Lukasz Cwik
I'm confused as to why it is valid to advance the watermark to T3 in the
original scenario.

T1 and T2 should be treated as inputs to the function and hold the input
watermark hence T1 should fire and if it doesn't produce any new timers
before T2, then T2 should fire since the watermark will now advance to T2.
The only time you would have multiple watermark timers fire as part of the
same bundle is if they were distinct timers both set to the same time.

I have some examples[1] documented in the modelling, scheduling, and
executing timers doc.

1:
https://docs.google.com/document/d/1GRL88rKLHbMR0zJnBHYwM4xtj66VYlB112EWVUFcGB0/edit#heading=h.fzptl5h0vi9k


On Wed, Jun 26, 2019 at 6:40 AM Reuven Lax  wrote:

> Earlier than the input watermark only applies to event time timers, but
> the above problem holds for processing time timers as well.
>
> On Wed, Jun 26, 2019, 1:50 PM Robert Bradshaw  wrote:
>
>> Yeah, it wouldn't be optimal performance-wise, but I think it's good
>> to keep the bar for a correct SDK low. Might still be better than
>> sending one timer per bundle, and you only pay the performance if
>> timers are set earlier than the input watermark (and there was a timer
>> firing in this range). (How often this happens probably varies a lot
>> in practice.)
>>
>> On Wed, Jun 26, 2019 at 2:33 PM Reuven Lax  wrote:
>> >
>> > This would have a lot of performance problems (especially since there
>> is user code that caches within a bundle, and invalidates the cache at the
>> end of every bundle). However this would be a valid "lazy" implementation.
>> >
>> > On Wed, Jun 26, 2019 at 2:29 PM Robert Bradshaw 
>> wrote:
>> >>
>> >> Note also that a "lazy" SDK implementation would be to simply return
>> >> all the timers (as if they were new timers) to runner once a timer set
>> >> (before or at the last requested timer in the bundle) is encountered.
>> >> E.g. Suppose we had timers T1, T3, T5 in the bundle. On firing T1, we
>> >> set T2 and delete T3. The SDK could then claim that a timers were
>> >> (again) set at T3, T5, then set one at at T2 and deleted at T3 and
>> >> then be done with the bundle (not actually process T3 and T5). (One
>> >> way to think about this is that timers are actually bundle splits into
>> >> a bundle of "done" and "future" work.) A more intelligent SDK could,
>> >> of course, process the whole bundle by tracking modifications to the
>> >> to-be-fired timers itself rather than requiring a trip through the
>> >> runner.
>> >>
>> >> On Wed, Jun 26, 2019 at 1:51 PM Reuven Lax  wrote:
>> >> >
>> >> > I like this option the best. It might be trickier to implement, but
>> seems like it would be the most consistent solution.
>> >> >
>> >> > Another problem it would solve is the following: let's say a bundle
>> arrives containing timers T1 and T2, and while processing T1 the user code
>> deletes T2 (or resets it to a time in the far future). I'm actually not
>> sure what we do today, but I'm a bit afraid that we will go ahead and fire
>> T2 since it's already in the bundle, which is clearly incorrect. The SDK
>> needs to keep track of this and skip T2 in order to solve this, which is
>> the same sort of work needed to implement Robert's suggestion.
>> >> >
>> >> > Reuven
>> >> >
>> >> > On Wed, Jun 26, 2019 at 12:28 PM Robert Bradshaw <
>> rober...@google.com> wrote:
>> >> >>
>> >> >> Another option, that is nice from an API perspective but places a
>> >> >> burden on SDK implementers (and possibly runners), is to maintain
>> the
>> >> >> ordering of timers by requiring timers to be fired in order, and if
>> >> >> any timers are set to fire them immediately before processing later
>> >> >> timers. In other words, if T1 sets T2 and modifies T3, these would
>> >> >> take effect (locally, the runner may not even know about T2) before
>> T3
>> >> >> was processed.
>> >> >>
>> >> >> On Wed, Jun 26, 2019 at 11:13 AM Jan Lukavský 
>> wrote:
>> >> >> >
>> >> >> > Hi,
>> >> >> >
>> >> >> > I have mentioned an issue I have come across [1] on several other
>> >> >> > threads, but it probably didn't attract the attention that it
>> would desire.
>> >> >> >
>> >> >> > I will try to restate the problem here for clarity:
>> >> >> >
>> >> >> >   - on runners that use concept of bundles (the original issue
>> mentions
>> >> >> > DirectRunner, but it will probably apply for other runners, which
>> use
>> >> >> > bundles, as well), the workflow is as follows:
>> >> >> >
>> >> >> >a) process elements in bundle
>> >> >> >
>> >> >> >b) advance watermark
>> >> >> >
>> >> >> >c) process timers
>> >> >> >
>> >> >> >d) continue to next bundle
>> >> >> >
>> >> >> >   - the issue with this is that when we are initially at time T0,
>> set
>> >> >> > two timers for T1 and T3, then advance watermark to T3 (or
>> beyond), the
>> >> >> > timers will fire (correctly) in order T1, T3, but if timer at T1
>> sets
>> >> >> > another timer for T2, then this timer will be fired in next
>> bundle (and
>> >> >> 

Re: Spotless exclusions

2019-06-26 Thread Lukasz Cwik
On Wed, Jun 26, 2019 at 4:22 PM Anton Kedin  wrote:

> Currently our spotless is configured globally [1] (for java at least) to
> include all source files by '**/*.java'. And then we exclude things
> explicitly. Don't know why, but these exclusions are ignored for me
> sometimes, for example `./gradlew :sdks:java:core:spotlessJavaCheck` always
> fails when checking the generated files under
> `.../build/generated-src/antlr/main/org/apache/beam/sdk/schemas/parser/generated`.
>
> Few questions:
>  * can someone point me to a discussion or a jira about this behavior?
>

BEAM-6399 and BEAM-7366 allude to something wonky going on.


>  * do we actually have a use case of checking the source files that are
> not under 'src'?
>

No


>  * if not, can we switch the config to only check for sources under 'src'
> [2]?
>

Yes


>  * alternatively, would it make sense to introduce project-specific
> overrides?
>

All src should be under src/ so it is unlikely to be useful.


>
> [1]
> https://github.com/apache/beam/blob/af9362168606df9ec11319fe706b72466413798c/buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy#L819
> [2] https://github.com/apache/beam/pull/8954
>


[DISCUSS] Releasing Vendored Artifacts

2019-06-25 Thread Lukasz Cwik
Ismael mentioned[1] that there is confusion about how to release and
validate vendored artifacts.

I have created this doc[2] and could use guidance from the community to
validate its contents. Feel free to comment on the doc or this thread. Note
that I used our release guide[3] as a basis for drafting it and removed all
the sections that didn't seem relevant.

I'll put the doc to a vote after a few days of feedback to accept it as a
process change and have it incorporated into the website.

1:
https://lists.apache.org/thread.html/bed6971afb751b2facf9c48bad76ce025281b1d3554813e35c0b4c52@%3Cdev.beam.apache.org%3E
2: https://s.apache.org/beam-release-vendored-artifacts
3: https://beam.apache.org/contribute/release-guide/


Re: [VOTE] Release vendored artifacts upgrading Guava usage to 26.0-jre, release candidate #1

2019-06-25 Thread Lukasz Cwik
Thanks Ismael. I'll cancel this vote for now with the intent that it is
re-opened in the future once Ismael's points are addressed.

I have opened a PR[1] which allows us to build the updated vendored
artifacts without migrating our usage to them.
I have started a doc[2] to gather people's notes on producing a release
guide for vendored dependencies.

1: https://github.com/apache/beam/pull/8942
2:
https://docs.google.com/document/d/1ztEoyGkqq9ie5riQxRtMuBu3vb6BUO91mSMn1PU0pDA/edit?usp=sharing

On Mon, Jun 24, 2019 at 1:04 PM Ismaël Mejía  wrote:

> The upgrade is worth for security reasons and to catch the gRPC
> improvements so really nice that this is happening.
>
> However I am not clear if we are mixing two things here. (1) The
> release of the vendored versions and (2) the upgrade of it into Beam.
> I think those should be separate (obviously knowing that the latter is
> more complex). My point is the vote should be ONLY about releasing the
> vendored dependencies and we should not mix it with what otherwise
> would be a JIRA issue for the dependency upgrade. (This was the case
> in the previous release of the vendored deps)
>
> An Apache release is usually validated from a .tar.gz with the source
> code checkout of the associated commit that should produce the
> artifacts, and I don't see any source code staged for validation in
> the links.
>
> All these points come from the fact that we have not documented the
> process of verification and in general of release of the vendored
> dependencies, so probably it is worth to do this and add it to the
> release guide [1] (or as an independent document) so we can do the
> validation eagerly.
>
> [1] https://beam.apache.org/contribute/release-guide/
>
> On Mon, Jun 24, 2019 at 6:02 PM Lukasz Cwik  wrote:
> >
> > Pinging for PMC to validate & vote.
> >
> > On Thu, Jun 20, 2019 at 3:52 PM Ahmet Altay  wrote:
> >>
> >> +1 verified signatures and hashes.
> >>
> >> Thank you Luke.
> >>
> >> On Thu, Jun 20, 2019 at 12:27 PM Lukasz Cwik  wrote:
> >>>
> >>> We should verify the signatures of the artifacts.
> >>>
> >>> Otherwise, there is little risk in releasing these artifacts because
> no one consumes them yet. PR/8899[1] updates Apache Beam to start using
> them and will go through the regular precommit/postcommit tests.
> >>>
> >>> If you want to perform additional validation you can:
> >>> * clone the PR and run any tests that you may want after fetching the
> artifacts and placing them in your local maven repo
> >>> * download the artifacts and manually validate the classes only appear
> in the org.apache.beam.vendor namespace with the appropriate package
> prefix. Note that there is a unit test that does this as part of the
> publishing process[2].
> >>>
> >>> This thread[3] is an example of previous release of vendored artifacts.
> >>>
> >>> 1: https://github.com/apache/beam/pull/8899
> >>> 2:
> https://github.com/apache/beam/blob/c775eda2df6457a784a1945d16cf781abb453d5f/buildSrc/src/main/groovy/org/apache/beam/gradle/VendorJavaPlugin.groovy#L127
> >>> 3:
> https://lists.apache.org/thread.html/9efb2aeab102e41367bf6b1f274d3ee5990024afd934392a339c4d00@%3Cdev.beam.apache.org%3E
> >>>
> >>> On Thu, Jun 20, 2019 at 11:20 AM Ahmet Altay  wrote:
> >>>>
> >>>> What is the best way to validate this?
> >>>>
> >>>> On Thu, Jun 20, 2019 at 9:51 AM Lukasz Cwik  wrote:
> >>>>>
> >>>>> Hi everyone,
> >>>>>
> >>>>> Please review the release of the following artifacts that we vendor:
> >>>>> beam-vendor-guava-26_0-jre
> >>>>> beam-vendor-grpc-1_21_0
> >>>>>
> >>>>> Please vote as follows:
> >>>>> [ ] +1, Approve the release
> >>>>> [ ] -1, Do not approve the release (please provide specific comments)
> >>>>>
> >>>>> The complete staging area is available for your review, which
> includes:
> >>>>> * all artifacts to be deployed to the Maven Central Repository [1],
> >>>>> * commit hash "996b4c3733545aaa3b93fd35296a391126026a1c" [2],
> >>>>> * which is signed with the key with fingerprint
> EAD5DE293F4A03DD2E77565589E68A56E371CCA2 [3],
> >>>>>
> >>>>> The vote will be open for at least 72 hours. It is adopted by
> majority approval, with at least 3 PMC affirmative votes.
> >>>>>
> >>>>> Note I have no intention to get this into the current 2.14 release
> that is being worked on and will have the version update go out with the
> next release.
> >>>>>
> >>>>> Thanks,
> >>>>> Luke
> >>>>>
> >>>>> [1]
> https://repository.apache.org/content/repositories/orgapachebeam-1074/
> >>>>> [2] https://github.com/apache/beam/pull/8899
> >>>>> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> >>>>>
>


Re: Change of Behavior - JDBC Set Command

2019-06-25 Thread Lukasz Cwik
That makes sense. I took a look at your PR, is there a way to do it without
exposing the reflection capabilities to pipeline authors?

On Mon, Jun 24, 2019 at 2:20 PM Alireza Samadian 
wrote:

> Hi all,
>
> I am writing to ask if it is OK to slightly change the behaviour of SET
> command in JDBC connection of Beam SQL. Currently, if we try to use set
> command for an option that does not exist or setting an option to an
> illegal value, it does not show any error until we run a query. This means
> one can potentially set it incorrectly and then reset it correctly and run
> query without getting any error. However, I want to make some changes in
> JDBC Driver that causes this behavior to be changed. After this change, if
> someone tries to use set command for a wrong pipeline option (in JDBC
> path), it will immediately see an error message.
>
> The reason for this change is because I am working on the Jira issue
> https://issues.apache.org/jira/projects/BEAM/issues/BEAM-7590, and I am
> removing the Pipeline Option Map representation and keep the actual
> pipeline options instead. As a result, each time that the set command is
> called, it will try to change the pipeline options instance using
> reflection instead of changing a map, and later constructing pipeline
> options from it.
>
> The following is a link to the pull request:
> https://github.com/apache/beam/pull/8928
>
> Best,
> Alireza Samadian
>


Re: PTransform.expand() guarantees

2019-06-21 Thread Lukasz Cwik
On Fri, Jun 21, 2019 at 10:01 AM Alexey Romanenko 
wrote:

> Thank you for answers, Lukasz.
>
> On 21 Jun 2019, at 18:15, Lukasz Cwik  wrote:
>
>> Does Beam guarantee where (at “driver” or at "worker” of backend system) "
>> *PTransform.expand()*” of provided transform will be called?
>>
> No. There are usecases where the driver is run in the "cloud" such as
> template generation and also during cross language pipeline expansion. At
> some point in time when I was investigating loops within Beam, one possible
> solution would have been to call expand() in the "worker" whenever a new
> loop iteration needed to be generated.
>
>
> Offtopic: what was a resolution about loops support? Is there any public
> doc about that to read?
>

https://issues.apache.org/jira/browse/BEAM-106 is the feature request.
Support for iteration got to the point where I had a working demo in
Dataflow batch but couldn't get it working in streaming due to lack of
time. The conceptual idea from the model was to treat it as an infinitely
unrollable loop which could be done dynamically during pipeline execution.
This seemed much easier then updating the entire system to support
multidimensional watermarks. It also seemed difficult to get support in
other runners since they couldn't dynamically modify the execution graph at
runtime or didn't support multidimensional watermarks.


> Does Beam guarantee how many times it could be happened?
>>
> It should happen once per transform instance but why is it important?
>
>
> This question was triggered by one issue in KinesisIO [1] where we did
> quite expensive check that stream exists in Write.expand() (calling backend
> Kinesis for that) and, according to issue reporter, it was called several
> times, though I always believed that it should be called only once. So,
> that puzzled me and I came to these questions.
>
> [1] https://issues.apache.org/jira/browse/BEAM-7357
>

Interesting, did you print a stack trace to figure out all the call site
locations?
It could be something simple where we are calling expand multiple times
needlessly and only using the result once.


Re: PTransform.expand() guarantees

2019-06-21 Thread Lukasz Cwik
On Fri, Jun 21, 2019 at 9:07 AM Alexey Romanenko 
wrote:

> Hello,
>
> I tried to find an answer in documentation for the questions below but I
> haven’t managed to do that. Actually, there are 3 related questions:
>
> Does Beam guarantee where (at “driver” or at "worker” of backend system) "
> *PTransform.expand()*” of provided transform will be called?
>
No. There are usecases where the driver is run in the "cloud" such as
template generation and also during cross language pipeline expansion. At
some point in time when I was investigating loops within Beam, one possible
solution would have been to call expand() in the "worker" whenever a new
loop iteration needed to be generated.


> Does Beam guarantee how many times it could be happened?
>
It should happen once per transform instance but why is it important?


> Does it depend on runner implementation or anything else?
>
It should not but historically in some places this has happened. Some
transforms were written with logic like am I a streaming pipeline or am I
running using Dataflow then do X. We have tried to prevent this from
happening and cleaned up places where we noticed this happens as this makes
pipelines hard to be portable across runners.

We have always wanted the driver program (wherever it may live) to give a
whole pipeline definition to the runner and the runner can then "optimize"
it by performing any additional PTransform replacements.


Re: Assigning Reviewers in GitHub?

2019-06-21 Thread Lukasz Cwik
Only a few people have permission to update the 'Reviewers' section and I
believe you either have to be a project PMC member or committer to be able
to update it which is why all people should use "R: @GITHUB-USERNAME" as
specified in the contribution guide[1].

1: https://beam.apache.org/contribute/#make-your-change

On Thu, Jun 20, 2019 at 11:00 PM Bruce Arctor 
wrote:

> Hi Dev,
>
> Today at Beam Summit, in Contributing workshop, Robert suggested adding
> reviewer by assigning someone to 'Reviewers' to the section on the right
> (where labels, assignees, milestones, etc live).  This was not possible
> without some additional permissions, but it is not clear what that
> requires?  Is it somehow joining the Apache Org on GitHub, having
> contributor permissions, etc?
>
> The other suggestion was to include "R: @GITHUB-USERNAME".  to mark/notify
> people.  Which does work.  But would be interested in understanding the
> practices for assigning reviewers in the method mentioned above.
>
> Please advise; this was also written on list, so that this knowledge can
> be findable/referenced in the future.
>
> Thanks,
> Austin
>
>


Re: [VOTE] Release vendored artifacts upgrading Guava usage to 26.0-jre, release candidate #1

2019-06-20 Thread Lukasz Cwik
We should verify the signatures of the artifacts.

Otherwise, there is little risk in releasing these artifacts because no one
consumes them yet. PR/8899[1] updates Apache Beam to start using them and
will go through the regular precommit/postcommit tests.

If you want to perform additional validation you can:
* clone the PR and run any tests that you may want after fetching the
artifacts and placing them in your local maven repo
* download the artifacts and manually validate the classes only appear in
the org.apache.beam.vendor namespace with the appropriate package prefix.
Note that there is a unit test that does this as part of the publishing
process[2].

This thread[3] is an example of previous release of vendored artifacts.

1: https://github.com/apache/beam/pull/8899
2:
https://github.com/apache/beam/blob/c775eda2df6457a784a1945d16cf781abb453d5f/buildSrc/src/main/groovy/org/apache/beam/gradle/VendorJavaPlugin.groovy#L127
3:
https://lists.apache.org/thread.html/9efb2aeab102e41367bf6b1f274d3ee5990024afd934392a339c4d00@%3Cdev.beam.apache.org%3E

On Thu, Jun 20, 2019 at 11:20 AM Ahmet Altay  wrote:

> What is the best way to validate this?
>
> On Thu, Jun 20, 2019 at 9:51 AM Lukasz Cwik  wrote:
>
>> Hi everyone,
>>
>> Please review the release of the following artifacts that we vendor:
>> beam-vendor-guava-26_0-jre
>> beam-vendor-grpc-1_21_0
>>
>> Please vote as follows:
>> [ ] +1, Approve the release
>> [ ] -1, Do not approve the release (please provide specific comments)
>>
>> The complete staging area is available for your review, which includes:
>> * all artifacts to be deployed to the Maven Central Repository [1],
>> * commit hash "996b4c3733545aaa3b93fd35296a391126026a1c" [2],
>> * which is signed with the key with fingerprint
>> EAD5DE293F4A03DD2E77565589E68A56E371CCA2 [3],
>>
>> The vote will be open for at least 72 hours. It is adopted by majority
>> approval, with at least 3 PMC affirmative votes.
>>
>> Note I have no intention to get this into the current 2.14 release that
>> is being worked on and will have the version update go out with the next
>> release.
>>
>> Thanks,
>> Luke
>>
>> [1]
>> https://repository.apache.org/content/repositories/orgapachebeam-1074/
>> [2] https://github.com/apache/beam/pull/8899
>> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>>
>>


[VOTE] Release vendored artifacts upgrading Guava usage to 26.0-jre, release candidate #1

2019-06-20 Thread Lukasz Cwik
Hi everyone,

Please review the release of the following artifacts that we vendor:
beam-vendor-guava-26_0-jre
beam-vendor-grpc-1_21_0

Please vote as follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)

The complete staging area is available for your review, which includes:
* all artifacts to be deployed to the Maven Central Repository [1],
* commit hash "996b4c3733545aaa3b93fd35296a391126026a1c" [2],
* which is signed with the key with fingerprint
EAD5DE293F4A03DD2E77565589E68A56E371CCA2 [3],

The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PMC affirmative votes.

Note I have no intention to get this into the current 2.14 release that is
being worked on and will have the version update go out with the next
release.

Thanks,
Luke

[1] https://repository.apache.org/content/repositories/orgapachebeam-1074/
[2] https://github.com/apache/beam/pull/8899
[3] https://dist.apache.org/repos/dist/release/beam/KEYS


Re: [Forked] BEAM-4046 (was [PROPOSAL] Introduce beam-sdks-java gradle project)

2019-06-17 Thread Lukasz Cwik
I have opened up https://github.com/apache/beam/pull/8881 which migrates
the jenkins runs to use the new names since it has been a month since the
rename went in.
After this, we should remove the deprecated project mappings and require
everyone to use the directory based names instead of the artifact names as
agreed upon in this thread.

On Mon, May 13, 2019 at 1:57 AM Michael Luckey  wrote:

> Thanks Kenn!
>
> This change will be merged in today, about 13:00 Pacific Time.
>
> From then on, will will have ditched the flat project layout. I.e project
> will need to be referenced by full path instead of some 'symbolic' name.
> E.g. project
> ':beam-sdks-java-core' will be referenced as ':sdks:java:core', which is
> just the relative folder path to the project.
>
> We wlll not run the seed job immediately, because some compatibility layer
> for Gradle command line is implemented, which enables running task
> referenced by the old style, i.e.
> ':beam-sdks-java-core:build' is still working for a while. After running
> the seed job, conflicting PRs might not be able to run on Jenkins anymore,
> but a rebase is required anyway
> so this should be done as soon as possible.
>
> Regarding the IDE, an import is required This should work apart from the
> known issues as before. Unfortunately I was unable to test every new option
> available. E.g. now we do have those
> 'runners:validateRunner' et al. task which did not 'exist' before. They
> should work, but if any issues arise, please open a jira and feel free to
> ping/assign to me.
>
> Thanks,
>
> michel
>
> On Thu, May 2, 2019 at 6:00 PM Kenneth Knowles  wrote:
>
>> The issue has been discussed for a full month, with no objections. I'd
>> call that lazy consensus. And since you have found a way to be backwards
>> compatible, it doesn't even have to impact docs or scripts. This is great.
>>
>> Kenn
>>
>> On Thu, May 2, 2019 at 8:43 AM Michael Luckey 
>> wrote:
>>
>>> Hi,
>>>
>>> after implementing the required changes to switch from the current flat
>>> Gradle project structure to the hierarchical represented by the folder
>>> hierarchy I propose to merge the changes [1] after cut of next release
>>> branch (which is scheduled around May, 8th.)
>>>
>>> Does anyone have any concerns or objections doing this change (or doing
>>> it now) or can we proceed as suggested? Any questions about this?
>>>
>>> Best,
>>>
>>> michel
>>>
>>> [1] https://github.com/apache/beam/pull/8194
>>>
>>>
>>>
>>> On Thu, Apr 11, 2019 at 1:04 AM Michael Luckey 
>>> wrote:
>>>
>>>> To my understanding, that's it, yes. Of course, there might be other
>>>> places/plugins which use plugin.group. But maven coordinates are definitely
>>>> those which need to be consistent.
>>>>
>>>> On Thu, Apr 11, 2019 at 12:57 AM Lukasz Cwik  wrote:
>>>>
>>>>> We may be saying the same thing but wanted to be clear that we only
>>>>> need to override the default that publishing plugin uses to always be
>>>>> "org.apache.beam" instead of defaulting to project.group
>>>>>
>>>>> On Wed, Apr 10, 2019 at 3:22 PM Kenneth Knowles 
>>>>> wrote:
>>>>>
>>>>>> So, if we set the "group" on projects only as part of publishing then
>>>>>> everything starts to work? That sounds ideal.
>>>>>>
>>>>>> Kenn
>>>>>>
>>>>>> On Tue, Apr 9, 2019 at 3:49 PM Lukasz Cwik  wrote:
>>>>>>
>>>>>>> It would be good if we did as much as possible to make our project
>>>>>>> as much as a conventional Gradle project. It means that more people 
>>>>>>> will be
>>>>>>> familiar with the setup, our setup will likely require less maintenance
>>>>>>> with version bumps in gradle and also that examples/solutions online 
>>>>>>> will
>>>>>>> relate better to our project.
>>>>>>>
>>>>>>> On Mon, Apr 8, 2019 at 6:22 PM Michael Luckey 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> After playing around, it turns out to be rather straightforward.
>>>>>>>> The problem is not really caused by a Gradle bug, but more by the usual
>>>>>>>> issue that deviating from gradle defaults/conventions often causes
>>>>>>

Re: Jira permissions

2019-06-13 Thread Lukasz Cwik
Welcome, I have added you as a contributor and assigned BEAM-7542 to you.

On Wed, Jun 12, 2019 at 9:18 PM Viktor Gerdin  wrote:

>
>Hello
>
>My name it Viktor
>
>I've encountered an issue (BEAM-7542
>) and would like to
>contribute a solution. I'd like to be added as a Jira contributor so
>that I can assign issues to myself. My ASF Jira Username is viktor.gerdin.
>
>Regards
>
>
>
>
>
>--
>
>
>V
>
>
>


DRAFT - Apache Beam Board Report - June '19

2019-06-13 Thread Lukasz Cwik
Hi all

Our next project report to the ASF Board of Directors is due June 14th.
I've seeded a draft here:
https://docs.google.com/document/d/1GY16lzVKL-mPh4M560AtqPAB1kXEptkhcBymvFr-4z8/edit?usp=sharing

Please help to eliminate all the TODOs by adding suggestions.

Luke


Re: Help triaging Jira issues

2019-06-12 Thread Lukasz Cwik
I looked at automating the two in JIRA but got the unhelpful:

"You are using Automation Lite for Jira. This is the free offering of
Automation for Jira Pro and only contains a small subset of the many
awesome features of the paid app. For example, project admins like yourself
can can only create and edit automation rules in the paid offering."

On Wed, Jun 12, 2019 at 2:22 AM Ismaël Mejía  wrote:

> Kenn can you or someone else with more JIRA-fu than me automatize both
> cases (I
> just triaged most of the still untriaged issues and found multiple new
> instances of
> both cases).
>
> On Fri, Jun 7, 2019 at 10:27 PM Kenneth Knowles  wrote:
> >
> > Nice. I noticed the huge drop in untriaged issues. Both of those ideas
> for automation sound reasonable.
> >
> > I think the other things that are harder to optimize can probably be
> addressed by re-triaging stale bugs. We will probably find those that
> should have been closed and those that are just sitting on an inactive
> contributor.
> >
> > Kenn
> >
> > On Fri, Jun 7, 2019 at 12:53 AM Ismaël Mejía  wrote:
> >>
> >> I took a look and reduced the untriaged issues to around 100. I
> >> noticed however some patterns that are producing more untriaged issues
> >> that we should have. Those can be probably automated (if JIRA has ways
> >> to do it):
> >>
> >> 1. Issues created and assigned on creation can be marked as open.
> >> 2. Once an issue has an associated PR it could be marked as open if it
> >> was in Triaged state.
> >>
> >> Other common case that is probably harder to automate are issues that
> >> are in Triaged state because we forgot to resolve/close them. I don’t
> >> know how we can improve these, apart of reminding people to look that
> >> they do not have untriaged assigned issues.
> >>
> >> Another interesting triage to do are the issues that are Open and
> >> assigned to members of the community that are not active anymore in
> >> the project, but that’s probably worth of another discussion, as well
> >> as how can we more effectively track open unassigned issues (which are
> >> currently around 1600).
> >>
> >> On Wed, Jun 5, 2019 at 7:03 PM Tanay Tummalapalli 
> wrote:
> >> >
> >> > Hi Kenneth,
> >> >
> >> > I already follow the issues@ mailing list pretty much daily.
> >> > I'd like to help with triaging issues, especially ones related to the
> Python SDK since I'm most familiar with it.
> >> >
> >> > On Wed, Jun 5, 2019 at 10:26 PM Alex Van Boxel 
> wrote:
> >> >>
> >> >> Hey Kenneth, I help out. I'm planning to contribute more on Beam and
> it seems to be ideal to keep up-to-date with the project.
> >> >>
> >> >>  _/
> >> >> _/ Alex Van Boxel
> >> >>
> >> >>
> >> >> On Wed, Jun 5, 2019 at 6:46 PM Kenneth Knowles 
> wrote:
> >> >>>
> >> >>> Hi all,
> >> >>>
> >> >>> I am requesting help in triaging incoming issues. I made a search
> here: https://issues.apache.org/jira/issues/?filter=12345682
> >> >>>
> >> >>> I have a daily email subscription to this filter as a reminder, but
> rarely can really sit down to do triage for very long. It has grown from
> just under 200 to just over 200. The rate is actually pretty low but there
> is a backlog. I also want to start re-triaging stale bugs but priority
> would be (1) keep up with new bugs (2) clear backlog (3) re-triage stale
> bugs.
> >> >>>
> >> >>> Just FYI what I look for before I clicked "Triaged" is:
> >> >>>
> >> >>>  - correct component
> >> >>>  - correct priority
> >> >>>  - maybe ping someone in a comment or assign
> >> >>>  - write to dev@ if it is a major problem
> >> >>>
> >> >>> If I can't figure that out, then I ask the reporter for
> clarification and "Start Watching" the issue so I will receive their
> response.
> >> >>>
> >> >>> To avoid duplicate triage work it may help to assign to yourself
> temporarily during triage phase.
> >> >>>
> >> >>> Any help greatly appreciated!
> >> >>>
> >> >>> Kenn
>


Re: Python dependency compatibility badges

2019-06-12 Thread Lukasz Cwik
SGTM

On Wed, Jun 12, 2019 at 8:53 AM Ahmet Altay  wrote:

> Looks like a nice improvement to me. To make it very explicit, it seems to
> focus on compatibility issues with google managed libraries even though the
> reports identify general old dependencies as well.
>
> On Wed, Jun 12, 2019 at 6:19 AM Ismaël Mejía  wrote:
>
>> +1 looks nice and can motivate people to tackle pending upgrades.
>>
>> On Wed, Jun 12, 2019 at 12:30 PM Robert Bradshaw 
>> wrote:
>> >
>> > We have a pull request to add Pypi dependency compatibility badges to
>> > our readme: https://github.com/apache/beam/pull/8791 This looks
>> > generally useful, though highlights how out of date we still are in
>> > some areas. Any thoughts on this? Given the high visibility, I wanted
>> > to get some consensus (even if just lazy) before merging.
>> >
>> > For reference, the badges (and their target links) are
>> >
>> > [![Compat Check PyPI]
>> >
>> https://python-compatibility-tools.appspot.com/one_badge_image?package=apache-beam%5Bgcp%5D
>> >
>> https://python-compatibility-tools.appspot.com/one_badge_target?package=apache-beam%5Bgcp%5D
>> >
>> > [![Compat Check Github]
>> >
>> https://python-compatibility-tools.appspot.com/one_badge_image?package=git%2Bgit%3A//github.com/apache/beam.git%23subdirectory%3Dsdks/python
>> >
>> https://python-compatibility-tools.appspot.com/one_badge_target?package=git%2Bgit%3A//github.com/apache/beam.git%23subdirectory%3Dsdks/python
>>
>


Re: Contributor permission request for Apache Beam Jira

2019-06-11 Thread Lukasz Cwik
You have been added.

On Tue, Jun 11, 2019 at 11:56 AM Andy Wang  wrote:

> It's anyyw, forgot to add that to my first email. Thanks!
>
> On Tue, Jun 11, 2019, 11:31 AM Lukasz Cwik  wrote:
>
>> Welcome, I tried to add you but there were multiple accounts with your
>> name. What is your JIRA id?
>>
>> On Mon, Jun 10, 2019 at 4:22 PM Andy Wang  wrote:
>>
>>> Hello,
>>>
>>> My name is Andy and I'd like to contribute where I can to the project.
>>> Been using the tool for over a year now and would like to start giving back
>>> to the community. Could someone give me the proper permissions to assign
>>> Jira tickets to myself?
>>>
>>> Thanks,
>>> Andy
>>>
>>


Re: Contributor permission request for Apache Beam Jira

2019-06-11 Thread Lukasz Cwik
Welcome, I tried to add you but there were multiple accounts with your
name. What is your JIRA id?

On Mon, Jun 10, 2019 at 4:22 PM Andy Wang  wrote:

> Hello,
>
> My name is Andy and I'd like to contribute where I can to the project.
> Been using the tool for over a year now and would like to start giving back
> to the community. Could someone give me the proper permissions to assign
> Jira tickets to myself?
>
> Thanks,
> Andy
>


  1   2   3   4   5   6   7   8   9   >