Contributor Permission for Beam Jira tickets

2019-11-25 Thread David Song
Hi,

This is David from DataPLS EngProd team (wintermelons@). I am working on
integration tests with some Beam runners over Dataflow.
Can someone add me as a contributor for the Beam's Jira tracker? I have an
open bug, and would like to assign myself to it.
My Jira username is wintermelons, and the Jira ticket is
https://issues.apache.org/jira/browse/BEAM-8814

Thanks,
David


Re: real real-time beam

2019-11-25 Thread Kenneth Knowles
Hi Aaron,

Another insightful observation.

Whenever an aggregation (GBK / Combine per key) has a trigger firing, there
is a per-key sequence number attached. It is included in metadata known as
"PaneInfo" [1]. The value of PaneInfo.getIndex() is colloquially referred
to as the "pane index". You can also make use of the "on time index" if you
like. The best way to access this metadata is to add a parameter of type
PaneInfo to your DoFn's @ProcessElement method. This works for stateful or
stateless DoFn.

Most of Beam's IO connectors do not explicitly enforce that outputs occur
in pane index order but instead rely on the hope that the runner delivers
panes in order to the sink. IMO this is dangerous but it has not yet caused
a known issue. In practice, each "input key to output key 'path' " through
a pipeline's logic does preserve order for all existing runners AFAIK and
it is the formalization that is missing. It is related to an
observation by +Rui
Wang  that processing retractions requires the same
key-to-key ordering.

I will not try to formalize this notion in this email. But I will note that
since it is universally assured, it would be zero cost and significantly
safer to formalize it and add an annotation noting it was required. It has
nothing to do with event time ordering, only trigger firing ordering.

Kenn

[1]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/windowing/PaneInfo.java
[2]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/DoFn.java#L557


On Mon, Nov 25, 2019 at 4:06 PM Pablo Estrada  wrote:

> The blog posts on stateful and timely computation with Beam should help
> clarify a lot about how to use state and timers to do this:
> https://beam.apache.org/blog/2017/02/13/stateful-processing.html
> https://beam.apache.org/blog/2017/08/28/timely-processing.html
>
> You'll see there how there's an implicit per-single-element grouping for
> each key, so state and timers should support your use case very well.
>
> Best
> -P.
>
> On Mon, Nov 25, 2019 at 3:47 PM Steve Niemitz  wrote:
>
>> If you have a pipeline that looks like Input -> GroupByKey -> ParDo,
>> while it is not guaranteed, in practice the sink will observe the trigger
>> firings in order (per key), since it'll be fused to the output of the GBK
>> operation (in all runners I know of).
>>
>> There have been a couple threads about trigger ordering as well on the
>> list recently that might have more information:
>>
>> https://lists.apache.org/thread.html/b61a908289a692dbd80dd6a869759eacd45b308cb3873bfb77c4def6@%3Cdev.beam.apache.org%3E
>>
>> https://lists.apache.org/thread.html/20d11046d26174969ef44a781e409a1cb9f7c736e605fa40fdf98397@%3Cuser.beam.apache.org%3E
>>
>>
>> On Mon, Nov 25, 2019 at 5:52 PM Aaron Dixon  wrote:
>>
>>> @Jan @Pablo Thank you
>>>
>>> @Pablo In this case it's a single global windowed Combine/perKey,
>>> triggered per element. Keys are few (client accounts) so they can live
>>> forever.
>>>
>>> It looks like just by virtue of using a stateful ParDo I could get this
>>> final execution to be "serialized" per key. (Then I could simply do the
>>> compare-and-swap using Beam's state mechanism to keep track of the "latest
>>> trigger timestamp" instead of having to orchestrate compare-and-swap in the
>>> target store :thinking:.)
>>>
>>>
>>>
>>> On Mon, Nov 25, 2019 at 4:14 PM Jan Lukavský  wrote:
>>>
 One addition, to make the list of options exhaustive, there is probably
 one more option

   c) create a ParDo keyed by primary key of your sink, cache the last
 write in there and compare it locally, without the need to query the
 database

 It would still need some timer to clear values after watermark +
 allowed
 lateness, because otherwise you would have to cache your whole database
 on workers. But because you don't need actual ordering, you just need
 the most recent value (if I got it right) this might be an option.

 Jan

 On 11/25/19 10:53 PM, Jan Lukavský wrote:
 > Hi Aaron,
 >
 > maybe someone else will give another option, but if I understand
 > correctly what you want to solve, then you essentially have to do
 either:
 >
 >  a) use the compare & swap mechanism in the sink you described
 >
 >  b) use a buffer to buffer elements inside the outputting ParDo and
 > only output them when watermark passes (using a timer).
 >
 > There is actually an ongoing discussion about how to make option b)
 > user-friendly and part of Beam itself, but currently there is no
 > out-of-the-box solution for that.
 >
 > Jan
 >
 > On 11/25/19 10:27 PM, Aaron Dixon wrote:
 >> Suppose I trigger a Combine per-element (in a high-volume stream)
 and
 >> use a ParDo as a sink.
 >>
 >> I assume there is no guarantee about the order that my ParDo will
 see
 >> these 

Re: cython test instability

2019-11-25 Thread Chad Dombrova
Actually, it looks like I'm getting the same error on multiple PRs:
https://scans.gradle.com/s/ihfmrxr7evslw




On Mon, Nov 25, 2019 at 10:26 PM Chad Dombrova  wrote:

> Hi all,
> The cython tests started failing on one of my PRs which were succeeding
> before.   The error is one that I've never seen before (separated onto
> different lines to make it easier to read):
>
> Caused by: org.gradle.api.GradleException:
> Could not copy file
> '/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit@2
> /src/sdks/python/.eggs/simplegeneric-0.8.1-py2.7.egg'
> to
> '/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit@2
> /src/sdks/python/test-suites/tox/py2/build/srcs/sdks/python/.eggs/simplegeneric-0.8.1-py2.7.egg'.
>
> Followed immediately by an error about could not create a directory of the
> same name.  Here's the gradle scan:
>
>
> https://scans.gradle.com/s/ihfmrxr7evslw/failure?openFailures=WzFd=WzZd#top=0
>
> Any ideas?
>
> -chad
>
>
>
>
>


cython test instability

2019-11-25 Thread Chad Dombrova
Hi all,
The cython tests started failing on one of my PRs which were succeeding
before.   The error is one that I've never seen before (separated onto
different lines to make it easier to read):

Caused by: org.gradle.api.GradleException:
Could not copy file
'/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit@2
/src/sdks/python/.eggs/simplegeneric-0.8.1-py2.7.egg'
to
'/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit@2
/src/sdks/python/test-suites/tox/py2/build/srcs/sdks/python/.eggs/simplegeneric-0.8.1-py2.7.egg'.

Followed immediately by an error about could not create a directory of the
same name.  Here's the gradle scan:

https://scans.gradle.com/s/ihfmrxr7evslw/failure?openFailures=WzFd=WzZd#top=0

Any ideas?

-chad


Re: [ANNOUNCE] New committer: Daniel Oliveira

2019-11-25 Thread Tanay Tummalapalli
Congratulations!

On Mon, Nov 25, 2019 at 11:12 PM Mark Liu  wrote:

> Congratulations, Daniel!
>
> On Mon, Nov 25, 2019 at 9:31 AM Ahmet Altay  wrote:
>
>> Congratulations, Daniel!
>>
>> On Sat, Nov 23, 2019 at 3:47 AM jincheng sun 
>> wrote:
>>
>>>
>>> Congrats, Daniel!
>>> Best,
>>> Jincheng
>>>
>>> Alexey Romanenko  于2019年11月22日周五 下午5:47写道:
>>>
 Congratulations, Daniel!

 On 22 Nov 2019, at 09:18, Jan Lukavský  wrote:

 Congrats Daniel!
 On 11/21/19 10:11 AM, Gleb Kanterov wrote:

 Congratulations!

 On Thu, Nov 21, 2019 at 6:24 AM Thomas Weise  wrote:

> Congratulations!
>
>
> On Wed, Nov 20, 2019, 7:56 PM Chamikara Jayalath 
> wrote:
>
>> Congrats!!
>>
>> On Wed, Nov 20, 2019 at 5:21 PM Daniel Oliveira <
>> danolive...@google.com> wrote:
>>
>>> Thank you everyone! I won't let you down. o7
>>>
>>> On Wed, Nov 20, 2019 at 2:12 PM Ruoyun Huang 
>>> wrote:
>>>
 Congrats Daniel!

 On Wed, Nov 20, 2019 at 1:58 PM Robert Burke 
 wrote:

> Congrats Daniel! Much deserved.
>
> On Wed, Nov 20, 2019, 12:49 PM Udi Meiri  wrote:
>
>> Congrats Daniel!
>>
>> On Wed, Nov 20, 2019 at 12:42 PM Kyle Weaver 
>> wrote:
>>
>>> Congrats Dan! Keep up the good work :)
>>>
>>> On Wed, Nov 20, 2019 at 12:41 PM Cyrus Maden 
>>> wrote:
>>>
 Congratulations! This is great news.

 On Wed, Nov 20, 2019 at 3:24 PM Rui Wang 
 wrote:

> Congrats!
>
>
> -Rui
>
> On Wed, Nov 20, 2019 at 11:48 AM Valentyn Tymofieiev <
> valen...@google.com> wrote:
>
>> Congrats, Daniel!
>>
>> On Wed, Nov 20, 2019 at 11:47 AM Kenneth Knowles <
>> k...@apache.org> wrote:
>>
>>> Hi all,
>>>
>>> Please join me and the rest of the Beam PMC in welcoming a
>>> new committer: Daniel Oliveira
>>>
>>> Daniel introduced himself to dev@ over two years ago and
>>> has contributed in many ways since then. Daniel has contributed 
>>> to general
>>> project health, the portability framework, and all three 
>>> languages: Java,
>>> Python SDK, and Go. I would like to particularly highlight how 
>>> he deleted
>>> 12k lines of dead reference runner code [1].
>>>
>>> In consideration of Daniel's contributions, the Beam PMC
>>> trusts him with the responsibilities of a Beam committer [2].
>>>
>>> Thank you, Daniel, for your contributions and looking
>>> forward to many more!
>>>
>>> Kenn, on behalf of the Apache Beam PMC
>>>
>>> [1] https://github.com/apache/beam/pull/8380
>>> [2]
>>> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>>>
>>

 --
 
 Ruoyun  Huang





Re: [Discuss] Beam Summit 2020 Dates & locations

2019-11-25 Thread Ahmet Altay
On Thu, Nov 21, 2019 at 2:49 PM Aizhamal Nurmamat kyzy 
wrote:

> Maria put together this documents with related industry conferences [1],
> it would make sense to choose a time that doesn't conflict with other
> events around projects close to Beam.
>
> How about for June 21-22 (around Spark Summit) for North America, and
> October 5-6 or October 12-13 for Europe?
>

Sounds good to me. IMO, October 12-13 for Europe is better, in between
related flink and spark summits.


>
> [1]
> https://docs.google.com/spreadsheets/d/1LQv95XP9UPhZjGqcMA8JksfcAure-fAp4h3TTjaafRs/edit#gid=1680445982
>
> On Tue, Nov 12, 2019 at 4:45 PM Udi Meiri  wrote:
>
>> +1 for better organization. I would have gone to ApacheCon LV had I known
>> there was going to be a Beam summit there.
>>
>> On Tue, Nov 12, 2019 at 9:31 AM Alexey Romanenko <
>> aromanenko@gmail.com> wrote:
>>
>>> On 8 Nov 2019, at 11:32, Maximilian Michels  wrote:
>>> >
>>> > The dates sounds good to me. I agree that the bay area has an
>>> advantage because of its large tech community. On the other hand, it is a
>>> question of how we run the event. For Berlin we managed to get about 200
>>> attendees to Berlin, but for the BeamSummit in Las Vegas with ApacheCon the
>>> attendance was much lower.
>>>
>>> I agree with your point Max and I believe that it would be more
>>> efficient to run Beam Summit as a “standalone" event (as it was done in
>>> London and Berlin) which will allow us to attract mostly
>>> Beam-oriented/interested/focused audience comparing to running this as part
>>> of ApacheCon or any other large conferences where are many other different
>>> topics and tracks.
>>>
>>> > Should this also be discussed on the user mailing list?
>>>
>>> Definitively! Despite the fact that users opinion is a key point here,
>>> it will not be so easy to get not-biased statistics in this question.
>>>
>>> The time frames are also very important since holidays in different
>>> countries (for example, August is traditionally a "vacation month" in
>>> France and some other European countries) can effect people availability
>>> and influent the final number of participants in the end.
>>>
>>> >
>>> > Cheers,
>>> > Max
>>> >
>>> > On 07.11.19 22:50, Alex Van Boxel wrote:
>>> >> For date wise, I'm wondering why we should switching the Europe and
>>> NA one, this would mean that the Berlin and the new EU summit would be
>>> almost 1.5 years apart.
>>> >>  _/
>>> >> _/ Alex Van Boxel
>>> >> On Thu, Nov 7, 2019 at 8:43 PM Ahmet Altay >> al...@google.com>> wrote:
>>> >>I prefer bay are for NA summit. My reasoning is that there is a
>>> >>criticall mass of contributors and users in that location, probably
>>> >>more than alternative NA locations. I was not involved with
>>> planning
>>> >>recently and I do not know if there were people who could attend
>>> due
>>> >>to location previously. If that is the case, I agree with Elliotte
>>> >>on looking for other options.
>>> >>Related to dates: March (Asia) and mid-May (NA) dates are a bit
>>> >>close. Mid-June for NA might be better to spread events. Other
>>> >>pieces looks good.
>>> >>Ahmet
>>> >>On Thu, Nov 7, 2019 at 7:09 AM Elliotte Rusty Harold
>>> >>mailto:elh...@ibiblio.org>> wrote:
>>> >>The U.S. sadly is not a reliable destination for international
>>> >>conferences these days. Almost every conference I go to, big
>>> and
>>> >>small, has at least one speaker, sometimes more, who can't get
>>> into
>>> >>the country. Canada seems worth considering. Vancouver,
>>> >>Montreal, and
>>> >>Toronto are all convenient.
>>> >>On Wed, Nov 6, 2019 at 2:17 PM Griselda Cuevas <
>>> g...@apache.org
>>> >>> wrote:
>>> >> >
>>> >> > Hi Beam Community!
>>> >> >
>>> >> > I'd like to kick off a thread to discuss potential dates and
>>> >>venues for the 2020 Beam Summits.
>>> >> >
>>> >> > I did some research on industry conferences happening in
>>> 2020
>>> >>and pre-selected a few ranges as follows:
>>> >> >
>>> >> > (2 days) NA between mid-May and mid-June
>>> >> > (2 days) EU mid October
>>> >> > (1 day) Asia Mini Summit:  March
>>> >> >
>>> >> > I'd like to hear your thoughts on these dates and get
>>> >>consensus on exact dates as the convo progresses.
>>> >> >
>>> >> > For locations these are the options I reviewed:
>>> >> >
>>> >> > NA: Austin Texas, Berkeley California, Mexico City.
>>> >> > Europe: Warsaw, Barcelona, Paris
>>> >> > Asia: Singapore
>>> >> >
>>> >> > Let the discussion begin!
>>> >> > G (on behalf of the Beam Summit Steering Committee)
>>> >> >
>>> >> >
>>> >> >
>>> >>-- Elliotte Rusty Harold
>>> >>elh...@ibiblio.org 

Re: Cleaning up Approximate Algorithms in Beam

2019-11-25 Thread Reza Rokni
Hi,

So do we need a vote for the final list of actions? Or is this thread
enough to go ahead and raise the PR's?

Cheers

Reza

On Tue, 26 Nov 2019 at 06:01, Ahmet Altay  wrote:

>
>
> On Mon, Nov 18, 2019 at 10:57 AM Robert Bradshaw 
> wrote:
>
>> On Sun, Nov 17, 2019 at 5:16 PM Reza Rokni  wrote:
>>
>>> *Ahmet: FWIW, There is a python implementation only for this
>>> version: 
>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38
>>> 
>>>  *
>>> Eventually we will be able to make use of cross language transforms to
>>> help with feature parity. Until then, are we ok with marking this
>>> deprecated in python, even though we do not have another solution. Or leave
>>> it as is in Python now, as it does not have sketch capability so can only
>>> be used for outputting results directly from the pipeline.
>>>
>>
> If it is our intention to add the capability eventually, IMO it makes
> sense to mark the existing functionality deprecated in Python as well.
>
>
>>> *Reuven: I think this is the sort of thing that has been experimental
>>> forever, and therefore not experimental (e.g. the entire triggering API is
>>> experimental as are all our file-based sinks). I think that many users use
>>> this, and probably store the state implicitly in streaming pipelines.*
>>> True, I have an old action item to try and go through and PR against
>>> old @experimental annotations but need to find time. So for this
>>> discussion; I guess this should be marked as deprecated if we change it
>>> even though its @experimental.
>>>
>>
>> Agreed.
>>
>>
>>> *Rob: I'm not following this--by naming things after their
>>> implementation rather than their intent I think they will be harder to
>>> search for. *
>>> This is to add to the name the implementation, after the intent. For
>>> example ApproximateCountDistinctZetaSketch, I believe should be easy to
>>> search for and it is clear which implementation is used. Allowing for a
>>> potentially better implementation ApproximateCountDistinct.
>>>
>>
>> OK, if we have both I'm more OK with that. This is better than the names
>> like HllCount, which seems to be what was suggested.
>>
>> Another approach would be to have a required  parameter which is an enum
>>> of the implementation options.
>>> ApproximateCountDistinct.of().usingImpl(ZETA) ?
>>>
>>
>> Ideally this could be an optional parameter, or possibly only required
>> during update until we figure out a good way for the runner to plug this in
>> appropreately.
>>
>> Rob/Kenn: On Combiner discussion, should we tie action items from the
>>> needs of this thread to this larger discussion?
>>>
>>> Cheers
>>> Reza
>>>
>>> On Fri, 15 Nov 2019 at 08:32, Robert Bradshaw 
>>> wrote:
>>>
 On Thu, Nov 14, 2019 at 1:06 AM Kenneth Knowles 
 wrote:

> Wow. Nice summary, yes. Major calls to action:
>
> 0. Never allow a combiner that does not include the format of its
> state clear in its name/URN. The "update compatibility" problem makes 
> their
> internal accumulator state essentially part of their public API. Combiners
> named for what they do are an inherent risk, since we might have a new way
> to do the same operation with different implementation-detail state.
>

 It seems this will make for a worse user experience, motivated solely
 by limitations in our implementation. I think we can do better.
 Hypothetical idea: what if upgrade required access to the original graph
 (or at least metadata about it) during construction? In this case an
 ApproximateDistinct could look at what was used last time and try to do the
 same, but be free to do something better when unconstrained. Another
 approach would be to encode several alternative expansions in the Beam
 graph and let the runner do the picking (based on prior submission).
 (Making the CombineFn, as opposed to the composite, have several
 alternatives seems harder to reason about, but maybe worth pursuing as
 well).

 This is not unique to Combiners, but any stateful DoFn, or composite
 operations with non-trivial internal structure (and coders). This has been
 discussed a lot, perhaps there are some ideas there we could borrow?

 And they will match search terms better, which is a major problem.
>

 I'm not following this--by naming things after their implementation
 rather than their intent I think they will be harder to search for.


> 1. Point users to HllCount. This seems to be the best of the three.
> Does it have a name that is clear enough about the format of its state?
> Noting that its Java package name includes zetasketch, perhaps.
> 2. Deprecate the others, at least. And remove them from e.g. Javadoc.
>

 +1


> On Wed, Nov 13, 2019 at 10:01 AM Reuven Lax  wrote:

Re: real real-time beam

2019-11-25 Thread Pablo Estrada
The blog posts on stateful and timely computation with Beam should help
clarify a lot about how to use state and timers to do this:
https://beam.apache.org/blog/2017/02/13/stateful-processing.html
https://beam.apache.org/blog/2017/08/28/timely-processing.html

You'll see there how there's an implicit per-single-element grouping for
each key, so state and timers should support your use case very well.

Best
-P.

On Mon, Nov 25, 2019 at 3:47 PM Steve Niemitz  wrote:

> If you have a pipeline that looks like Input -> GroupByKey -> ParDo, while
> it is not guaranteed, in practice the sink will observe the trigger firings
> in order (per key), since it'll be fused to the output of the GBK operation
> (in all runners I know of).
>
> There have been a couple threads about trigger ordering as well on the
> list recently that might have more information:
>
> https://lists.apache.org/thread.html/b61a908289a692dbd80dd6a869759eacd45b308cb3873bfb77c4def6@%3Cdev.beam.apache.org%3E
>
> https://lists.apache.org/thread.html/20d11046d26174969ef44a781e409a1cb9f7c736e605fa40fdf98397@%3Cuser.beam.apache.org%3E
>
>
> On Mon, Nov 25, 2019 at 5:52 PM Aaron Dixon  wrote:
>
>> @Jan @Pablo Thank you
>>
>> @Pablo In this case it's a single global windowed Combine/perKey,
>> triggered per element. Keys are few (client accounts) so they can live
>> forever.
>>
>> It looks like just by virtue of using a stateful ParDo I could get this
>> final execution to be "serialized" per key. (Then I could simply do the
>> compare-and-swap using Beam's state mechanism to keep track of the "latest
>> trigger timestamp" instead of having to orchestrate compare-and-swap in the
>> target store :thinking:.)
>>
>>
>>
>> On Mon, Nov 25, 2019 at 4:14 PM Jan Lukavský  wrote:
>>
>>> One addition, to make the list of options exhaustive, there is probably
>>> one more option
>>>
>>>   c) create a ParDo keyed by primary key of your sink, cache the last
>>> write in there and compare it locally, without the need to query the
>>> database
>>>
>>> It would still need some timer to clear values after watermark + allowed
>>> lateness, because otherwise you would have to cache your whole database
>>> on workers. But because you don't need actual ordering, you just need
>>> the most recent value (if I got it right) this might be an option.
>>>
>>> Jan
>>>
>>> On 11/25/19 10:53 PM, Jan Lukavský wrote:
>>> > Hi Aaron,
>>> >
>>> > maybe someone else will give another option, but if I understand
>>> > correctly what you want to solve, then you essentially have to do
>>> either:
>>> >
>>> >  a) use the compare & swap mechanism in the sink you described
>>> >
>>> >  b) use a buffer to buffer elements inside the outputting ParDo and
>>> > only output them when watermark passes (using a timer).
>>> >
>>> > There is actually an ongoing discussion about how to make option b)
>>> > user-friendly and part of Beam itself, but currently there is no
>>> > out-of-the-box solution for that.
>>> >
>>> > Jan
>>> >
>>> > On 11/25/19 10:27 PM, Aaron Dixon wrote:
>>> >> Suppose I trigger a Combine per-element (in a high-volume stream) and
>>> >> use a ParDo as a sink.
>>> >>
>>> >> I assume there is no guarantee about the order that my ParDo will see
>>> >> these triggers, especially as it processes in parallel, anyway.
>>> >>
>>> >> That said, my sink writes to a db or cache and I would not like the
>>> >> cache to ever regress its value to something "before" what it has
>>> >> already written.
>>> >>
>>> >> Is the best way to solve this problem to always write the event-time
>>> >> in the cache and do a compare-and-swap only updating the sink if the
>>> >> triggered value in-hand is later than the target value?
>>> >>
>>> >> Or is there a better way to guarantee that my ParDo sink will process
>>> >> elements in-order? (Eg, if I can give up per-event/real-time, then a
>>> >> delay-based trigger would probably be sufficient I imagine.)
>>> >>
>>> >> Thanks for advice!
>>>
>>


Re: real real-time beam

2019-11-25 Thread Steve Niemitz
If you have a pipeline that looks like Input -> GroupByKey -> ParDo, while
it is not guaranteed, in practice the sink will observe the trigger firings
in order (per key), since it'll be fused to the output of the GBK operation
(in all runners I know of).

There have been a couple threads about trigger ordering as well on the list
recently that might have more information:
https://lists.apache.org/thread.html/b61a908289a692dbd80dd6a869759eacd45b308cb3873bfb77c4def6@%3Cdev.beam.apache.org%3E
https://lists.apache.org/thread.html/20d11046d26174969ef44a781e409a1cb9f7c736e605fa40fdf98397@%3Cuser.beam.apache.org%3E


On Mon, Nov 25, 2019 at 5:52 PM Aaron Dixon  wrote:

> @Jan @Pablo Thank you
>
> @Pablo In this case it's a single global windowed Combine/perKey,
> triggered per element. Keys are few (client accounts) so they can live
> forever.
>
> It looks like just by virtue of using a stateful ParDo I could get this
> final execution to be "serialized" per key. (Then I could simply do the
> compare-and-swap using Beam's state mechanism to keep track of the "latest
> trigger timestamp" instead of having to orchestrate compare-and-swap in the
> target store :thinking:.)
>
>
>
> On Mon, Nov 25, 2019 at 4:14 PM Jan Lukavský  wrote:
>
>> One addition, to make the list of options exhaustive, there is probably
>> one more option
>>
>>   c) create a ParDo keyed by primary key of your sink, cache the last
>> write in there and compare it locally, without the need to query the
>> database
>>
>> It would still need some timer to clear values after watermark + allowed
>> lateness, because otherwise you would have to cache your whole database
>> on workers. But because you don't need actual ordering, you just need
>> the most recent value (if I got it right) this might be an option.
>>
>> Jan
>>
>> On 11/25/19 10:53 PM, Jan Lukavský wrote:
>> > Hi Aaron,
>> >
>> > maybe someone else will give another option, but if I understand
>> > correctly what you want to solve, then you essentially have to do
>> either:
>> >
>> >  a) use the compare & swap mechanism in the sink you described
>> >
>> >  b) use a buffer to buffer elements inside the outputting ParDo and
>> > only output them when watermark passes (using a timer).
>> >
>> > There is actually an ongoing discussion about how to make option b)
>> > user-friendly and part of Beam itself, but currently there is no
>> > out-of-the-box solution for that.
>> >
>> > Jan
>> >
>> > On 11/25/19 10:27 PM, Aaron Dixon wrote:
>> >> Suppose I trigger a Combine per-element (in a high-volume stream) and
>> >> use a ParDo as a sink.
>> >>
>> >> I assume there is no guarantee about the order that my ParDo will see
>> >> these triggers, especially as it processes in parallel, anyway.
>> >>
>> >> That said, my sink writes to a db or cache and I would not like the
>> >> cache to ever regress its value to something "before" what it has
>> >> already written.
>> >>
>> >> Is the best way to solve this problem to always write the event-time
>> >> in the cache and do a compare-and-swap only updating the sink if the
>> >> triggered value in-hand is later than the target value?
>> >>
>> >> Or is there a better way to guarantee that my ParDo sink will process
>> >> elements in-order? (Eg, if I can give up per-event/real-time, then a
>> >> delay-based trigger would probably be sufficient I imagine.)
>> >>
>> >> Thanks for advice!
>>
>


Re: [DISCUSS] AWS IOs V1 Deprecation Plan

2019-11-25 Thread Luke Cwik
Phase I sounds fine.

Apache Beam follows semantic versioning and I believe removing the IOs will
be a backwards incompatible change unless they were marked experimental
which will be a problem for Phase 2.

What is the feasibility of making the V1 transforms wrappers around V2?

On Mon, Nov 25, 2019 at 1:46 PM Cam Mach  wrote:

> Hello Beam Devs,
>
> I have been working on the migration of Amazon Web Services IO connectors
> into the new AWS SDK for Java V2. The goal is to have an updated
> implementation aligned with the most recent AWS improvements. So far we
> have already migrated the connectors for AWS SNS, SQS and  DynamoDB.
>
> In the meantime some contributions are still going on V1 IOs. So far we
> have dealt with those by porting (or asking contributors) to port the
> changes into V2 IOs too because we don’t want features of both versions to
> be unaligned but this may quickly become a maintenance issue, so we want to
> discuss a plan to stop supporting (deprecate) V1 IOs and encourage users to
> move to V2.
>
> Phase I (ASAP):
>
>-
>
>Mark migrated AWS V1 IOs as deprecated
>-
>
>Document migration path to V2
>
> Phase II (end of 2020):
>
>-
>
>Decide a date or Beam release to remove the V1 IOs
>-
>
>Send a notification to the community 3 months before we remove them
>-
>
>Completely get rid of V1 IOs
>
>
> Please let me know what you think or if you see any potential issues?
>
> Thanks,
> Cam Mach
>
>


Re: real real-time beam

2019-11-25 Thread Aaron Dixon
@Jan @Pablo Thank you

@Pablo In this case it's a single global windowed Combine/perKey, triggered
per element. Keys are few (client accounts) so they can live forever.

It looks like just by virtue of using a stateful ParDo I could get this
final execution to be "serialized" per key. (Then I could simply do the
compare-and-swap using Beam's state mechanism to keep track of the "latest
trigger timestamp" instead of having to orchestrate compare-and-swap in the
target store :thinking:.)



On Mon, Nov 25, 2019 at 4:14 PM Jan Lukavský  wrote:

> One addition, to make the list of options exhaustive, there is probably
> one more option
>
>   c) create a ParDo keyed by primary key of your sink, cache the last
> write in there and compare it locally, without the need to query the
> database
>
> It would still need some timer to clear values after watermark + allowed
> lateness, because otherwise you would have to cache your whole database
> on workers. But because you don't need actual ordering, you just need
> the most recent value (if I got it right) this might be an option.
>
> Jan
>
> On 11/25/19 10:53 PM, Jan Lukavský wrote:
> > Hi Aaron,
> >
> > maybe someone else will give another option, but if I understand
> > correctly what you want to solve, then you essentially have to do either:
> >
> >  a) use the compare & swap mechanism in the sink you described
> >
> >  b) use a buffer to buffer elements inside the outputting ParDo and
> > only output them when watermark passes (using a timer).
> >
> > There is actually an ongoing discussion about how to make option b)
> > user-friendly and part of Beam itself, but currently there is no
> > out-of-the-box solution for that.
> >
> > Jan
> >
> > On 11/25/19 10:27 PM, Aaron Dixon wrote:
> >> Suppose I trigger a Combine per-element (in a high-volume stream) and
> >> use a ParDo as a sink.
> >>
> >> I assume there is no guarantee about the order that my ParDo will see
> >> these triggers, especially as it processes in parallel, anyway.
> >>
> >> That said, my sink writes to a db or cache and I would not like the
> >> cache to ever regress its value to something "before" what it has
> >> already written.
> >>
> >> Is the best way to solve this problem to always write the event-time
> >> in the cache and do a compare-and-swap only updating the sink if the
> >> triggered value in-hand is later than the target value?
> >>
> >> Or is there a better way to guarantee that my ParDo sink will process
> >> elements in-order? (Eg, if I can give up per-event/real-time, then a
> >> delay-based trigger would probably be sufficient I imagine.)
> >>
> >> Thanks for advice!
>


Re: Full stream-stream join semantics

2019-11-25 Thread Kenneth Knowles
On Mon, Nov 25, 2019 at 1:56 PM Jan Lukavský  wrote:

> Hi Rui,
>
> > Hi Kenn, you think stateful DoFn based join can emit joined rows that
> never to be retracted because in stateful DoFn case joined rows will be
> controlled by timers and emit will be only once? If so I will agree with
> it. Generally speaking, if only emit once is the factor of needing
> retraction or not.
>
> that would imply buffering elements up until watermark, then sorting and
> so reduces to the option a) again, is that true? This also has to deal with
> allowed lateness, that would mean, that with allowed lateness greater than
> zero, there can still be multiple firings and so retractions are needed.
>
Specifically, when I say "bi-temporal join" I mean unbounded-to-unbounded
join where one of the join conditions is that elements are within event
time distance d of one another. An element at time t will be saved until
time t + 2d and then garbage collected. Every matching pair can be emitted
immediately.

In the triggered CoGBK + join-product implementation, you do need
retractions as a model concept. But you don't need full support, since they
only need to be shipped as deltas and only from the CoGBK to the
join-product transform where they are all consumed to create only positive
elements. Again a delay is not required; this yields correct results with
the "always" trigger.

Neither case requires waiting or time sorting a whole buffer. The
bi-temporal join requires something more, in a way, since you need to query
by time range and GC time prefixes.

Kenn

Jan
> On 11/25/19 10:17 PM, Rui Wang wrote:
>
>
>
> On Mon, Nov 25, 2019 at 11:29 AM Jan Lukavský  wrote:
>
>>
>> On 11/25/19 7:47 PM, Kenneth Knowles wrote:
>>
>>
>>
>> On Sun, Nov 24, 2019 at 12:57 AM Jan Lukavský  wrote:
>>
>>> I can put down a design document, but before that I need to clarify some
>>> things for me. I'm struggling to put all of this into a bigger picture.
>>> Sorry if the arguments are circulating, but I didn't notice any proposal of
>>> how to solve these. If anyone can disprove any of this logic it would be
>>> very much appreciated as I might be able to get from a dead end:
>>>
>>>  a) in the bi-temporal join you can either buffer until watermark, or
>>> emit false data that has to be retracted
>>>
>> This is not the case. A stateful DoFn based join can emit immediately
>> joined rows that will never need to be retracted. The need for retractions
>> has to do with CoGBK-based implementation of a join.
>>
>> I fail to see how this could work. If I emit joined rows immediately
>> without waiting for watermark to pass, I can join two elements, that don't
>> belong to each other, because later can arrive element with lower time
>> distance, that should have been joint in the place of the previously
>> emitted one. This is wrong result that has to be retracted. Or what I'm
>> missing?
>>
>
> Hi Kenn, you think stateful DoFn based join can emit joined rows that
> never to be retracted because in stateful DoFn case joined rows will be
> controlled by timers and emit will be only once? If so I will agree with
> it. Generally speaking, if only emit once is the factor of needing
> retraction or not.
>
> In the past brainstorming, even having retractions ready, streaming join
> with windowing are likely be implemented by a style of CoGBK + stateful
> DoFn.
>
>
>
> I suggest that you work out the definition of the join you are interested
>> in, with a good amount of mathematical rigor, and then consider the ways
>> you can implement it. That is where a design doc will probably clarify
>> things.
>>
>> Kenn
>>
>>  b) until retractions are 100% functional (and that is sort of holy grail
>>> for now), then the only solution is using a buffer holding data up to
>>> watermark *and then sort by event time*
>>>
>>  c) even if retractions were 100% functional, there would have to be
>>> special implementation for batch case, because otherwise this would simply
>>> blow up downstream processing with insanely many false additions and
>>> subsequent retractions
>>>
>>> Property b) means that if we want this feature now, we must sort by
>>> event time and there is no way around. Property c) shows that even in the
>>> future, we must make (in certain cases) distinction between batch and
>>> streaming code paths, which seems weird to me, but it might be an option.
>>> But still, there is no way to express this join in batch case, because it
>>> would require either buffering (up to) whole input on local worker (doesn't
>>> look like viable option) or provide a way in user code to signal the need
>>> for ordering of data inside GBK (and we are there again :)). Yes, we might
>>> shift this need from stateful dofn to GBK like
>>>
>>>  input.apply(GroupByKey.sorted())
>>>
>>> I cannot find a good reasoning why this would be better than giving this
>>> semantics to (stateful) ParDo.
>>>
>>> Maybe someone can help me out here?
>>>
>>> Jan
>>> On 11/24/19 5:05 AM, Kenneth Knowles 

Re: real real-time beam

2019-11-25 Thread Jan Lukavský
One addition, to make the list of options exhaustive, there is probably 
one more option


 c) create a ParDo keyed by primary key of your sink, cache the last 
write in there and compare it locally, without the need to query the 
database


It would still need some timer to clear values after watermark + allowed 
lateness, because otherwise you would have to cache your whole database 
on workers. But because you don't need actual ordering, you just need 
the most recent value (if I got it right) this might be an option.


Jan

On 11/25/19 10:53 PM, Jan Lukavský wrote:

Hi Aaron,

maybe someone else will give another option, but if I understand 
correctly what you want to solve, then you essentially have to do either:


 a) use the compare & swap mechanism in the sink you described

 b) use a buffer to buffer elements inside the outputting ParDo and 
only output them when watermark passes (using a timer).


There is actually an ongoing discussion about how to make option b) 
user-friendly and part of Beam itself, but currently there is no 
out-of-the-box solution for that.


Jan

On 11/25/19 10:27 PM, Aaron Dixon wrote:
Suppose I trigger a Combine per-element (in a high-volume stream) and 
use a ParDo as a sink.


I assume there is no guarantee about the order that my ParDo will see 
these triggers, especially as it processes in parallel, anyway.


That said, my sink writes to a db or cache and I would not like the 
cache to ever regress its value to something "before" what it has 
already written.


Is the best way to solve this problem to always write the event-time 
in the cache and do a compare-and-swap only updating the sink if the 
triggered value in-hand is later than the target value?


Or is there a better way to guarantee that my ParDo sink will process 
elements in-order? (Eg, if I can give up per-event/real-time, then a 
delay-based trigger would probably be sufficient I imagine.)


Thanks for advice!


Re: [VOTE] Beam Mascot animal choice: vote for as many as you want

2019-11-25 Thread Mark Liu
[ ] Beaver
[ ] Hedgehog
[ ] Lemur
[ ] Owl
[ ] Salmon
[ ] Trout
[ ] Robot dinosaur
[ ] Firefly
[ ] Cuttlefish
[X] Dumbo Octopus
[ ] Angler fish

On Mon, Nov 25, 2019 at 1:22 PM David Cavazos  wrote:

> Hi Kenneth, I tried adding back the email addresses, but they weren't
> added on the existing responses, it would only add them on new ones. :(
>
> I've already made it not accept new responses.
>
> There are only 8 responses (2 mine, 1 my real vote and 1 empty test vote),
> so hopefully everyone who voted there can vote back here.
>
> On Sat, Nov 23, 2019 at 7:27 PM Kenneth Knowles  wrote:
>
>> David - if you can reconfigure the form so it is not anonymous (at least
>> to me) then I may be up for including those results in the tally. I don't
>> want to penalize those who voted via the form. But since there are now two
>> voting channels we have to dedupe or discard the form results. And I need
>> to be able to see which votes are PMC. Even if advisory, it does need to
>> move to a concluding vote, and PMC votes could be a tiebreaker of sorts.
>>
>> Kenn
>>
>> On Sat, Nov 23, 2019 at 7:17 PM Kenneth Knowles  wrote:
>>
>>> On Fri, Nov 22, 2019 at 10:24 AM Robert Bradshaw 
>>> wrote:
>>>
 On Thu, Nov 21, 2019 at 7:05 PM David Cavazos 
 wrote:

>
>
> I created this Google Form
> 
> if everyone is okay with it to make it easier to both vote and view the
> results :)
>

 Generally decisions, especially votes, for apache projects are supposed
 to happen on-list. I suppose this is more an advisory vote, but still
 probably makes sense to keep it here. .

>>>
>>> Indeed. Someone suggested a Google form before I started this, but I
>>> deliberately didn't use it. It doesn't add much and it puts the vote off
>>> list onto opaque and mutable third party infrastructure.
>>>
>>> If you voted on the form, please repeat it on thread so I can count it.
>>>
>>> Kenn
>>>
>>>
>>>
>>> import collections, pprint, re, requests
 thread = requests.get('
 https://lists.apache.org/api/thread.lua?id=ff60eabbf8349ba6951633869000356c2c2feb48bbff187cf3c60039@%3Cdev.beam.apache.org%3E').json(
 )
 counts = collections.defaultdict(int)
 for email in thread['emails']:
   body = requests.get('https://lists.apache.org/api/email.lua?id=%s' %
 email['mid']).json()['body']
   for vote in re.findall(r'\n\s*\[\s*[xX]\s*\]\s*([a-zA-Z ]+)', body):
 counts[vote] += 1
   pprint.pprint(sorted(counts.items(), key=lambda kv: kv[-1]))

 ...

 [('Beaver', 1),

  ('Capybara', 2),

  ('Trout', 2),

  ('Salmon', 4),

  ('Dumbo Octopus', 7),

  ('Robot dinosaur', 9),

  ('Hedgehog', 10),

  ('Cuttlefish', 11),

  ('Angler fish', 12),

  ('Lemur', 14),

  ('Owl', 15),

  ('Firefly', 17)]



>
> On Thu, Nov 21, 2019 at 6:18 PM Vinay Mayar <
> vinay.ma...@expanseinc.com> wrote:
>
>> [ ] Beaver
>> [ ] Hedgehog
>> [ ] Lemur
>> [ ] Owl
>> [ ] Salmon
>> [ ] Trout
>> [ ] Robot dinosaur
>> [ ] Firefly
>> [ ] Cuttlefish
>> [x] Dumbo Octopus
>> [ ] Angler fish
>>
>> On Thu, Nov 21, 2019 at 6:14 PM Chamikara Jayalath <
>> chamik...@google.com> wrote:
>>
>>> [X] Beaver
>>> [ ] Hedgehog
>>> [ ] Lemur
>>> [X] Owl
>>> [ ] Salmon
>>> [ ] Trout
>>> [ ] Robot dinosaur
>>> [ ] Firefly
>>> [X ] Cuttlefish
>>> [X ] Dumbo Octopus
>>> [ X] Angler fish
>>>
>>> Thanks,
>>> Cham
>>>
>>> On Thu, Nov 21, 2019 at 1:43 PM Michał Walenia <
>>> michal.wale...@polidea.com> wrote:
>>>
 [X] Beaver
 [ ] Hedgehog
 [X] Lemur
 [X] Owl
 [ ] Salmon
 [ ] Trout
 [X] Robot dinosaur
 [X] Firefly
 [ ] Cuttlefish
 [ ] Dumbo Octopus
 [ ] Angler fish

 On Thu, Nov 21, 2019 at 1:11 PM Aizhamal Nurmamat kyzy <
 aizha...@apache.org> wrote:

> [ ] Beaver
> [X] Hedgehog
> [ ] Lemur
> [ ] Owl
> [ ] Salmon
> [ ] Trout
> [ ] Robot dinosaur
> [ ] Firefly
> [X] Cuttlefish
> [ ] Dumbo Octopus
> [ ] Angler fish
>
> On Thu, Nov 21, 2019 at 11:21 AM Robert Burke 
> wrote:
>
>> [ X] Beaver
>> [] Hedgehog
>> [ x] Lemur
>> [ X] Owl
>> [ ] Salmon
>> [ ] Trout
>> [ ] Robot dinosaur
>> [X ] Firefly
>> [ X] Cuttlefish
>> [x ] Dumbo Octopus
>> [X ] Angler fish
>>
>> On Thu, Nov 21, 2019, 9:33 AM Łukasz Gajowy <
>> lukasz.gaj...@gmail.com> wrote:
>>
>>> [ ] Beaver
>>> [ ] Hedgehog
>>> 

Re: Cleaning up Approximate Algorithms in Beam

2019-11-25 Thread Ahmet Altay
On Mon, Nov 18, 2019 at 10:57 AM Robert Bradshaw 
wrote:

> On Sun, Nov 17, 2019 at 5:16 PM Reza Rokni  wrote:
>
>> *Ahmet: FWIW, There is a python implementation only for this
>> version: 
>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/stats.py#L38
>> 
>>  *
>> Eventually we will be able to make use of cross language transforms to
>> help with feature parity. Until then, are we ok with marking this
>> deprecated in python, even though we do not have another solution. Or leave
>> it as is in Python now, as it does not have sketch capability so can only
>> be used for outputting results directly from the pipeline.
>>
>
If it is our intention to add the capability eventually, IMO it makes sense
to mark the existing functionality deprecated in Python as well.


>> *Reuven: I think this is the sort of thing that has been experimental
>> forever, and therefore not experimental (e.g. the entire triggering API is
>> experimental as are all our file-based sinks). I think that many users use
>> this, and probably store the state implicitly in streaming pipelines.*
>> True, I have an old action item to try and go through and PR against
>> old @experimental annotations but need to find time. So for this
>> discussion; I guess this should be marked as deprecated if we change it
>> even though its @experimental.
>>
>
> Agreed.
>
>
>> *Rob: I'm not following this--by naming things after their implementation
>> rather than their intent I think they will be harder to search for. *
>> This is to add to the name the implementation, after the intent. For
>> example ApproximateCountDistinctZetaSketch, I believe should be easy to
>> search for and it is clear which implementation is used. Allowing for a
>> potentially better implementation ApproximateCountDistinct.
>>
>
> OK, if we have both I'm more OK with that. This is better than the names
> like HllCount, which seems to be what was suggested.
>
> Another approach would be to have a required  parameter which is an enum
>> of the implementation options.
>> ApproximateCountDistinct.of().usingImpl(ZETA) ?
>>
>
> Ideally this could be an optional parameter, or possibly only required
> during update until we figure out a good way for the runner to plug this in
> appropreately.
>
> Rob/Kenn: On Combiner discussion, should we tie action items from the
>> needs of this thread to this larger discussion?
>>
>> Cheers
>> Reza
>>
>> On Fri, 15 Nov 2019 at 08:32, Robert Bradshaw 
>> wrote:
>>
>>> On Thu, Nov 14, 2019 at 1:06 AM Kenneth Knowles  wrote:
>>>
 Wow. Nice summary, yes. Major calls to action:

 0. Never allow a combiner that does not include the format of its state
 clear in its name/URN. The "update compatibility" problem makes their
 internal accumulator state essentially part of their public API. Combiners
 named for what they do are an inherent risk, since we might have a new way
 to do the same operation with different implementation-detail state.

>>>
>>> It seems this will make for a worse user experience, motivated solely by
>>> limitations in our implementation. I think we can do better. Hypothetical
>>> idea: what if upgrade required access to the original graph (or at least
>>> metadata about it) during construction? In this case an ApproximateDistinct
>>> could look at what was used last time and try to do the same, but be free
>>> to do something better when unconstrained. Another approach would be to
>>> encode several alternative expansions in the Beam graph and let the runner
>>> do the picking (based on prior submission). (Making the CombineFn, as
>>> opposed to the composite, have several alternatives seems harder to reason
>>> about, but maybe worth pursuing as well).
>>>
>>> This is not unique to Combiners, but any stateful DoFn, or composite
>>> operations with non-trivial internal structure (and coders). This has been
>>> discussed a lot, perhaps there are some ideas there we could borrow?
>>>
>>> And they will match search terms better, which is a major problem.

>>>
>>> I'm not following this--by naming things after their implementation
>>> rather than their intent I think they will be harder to search for.
>>>
>>>
 1. Point users to HllCount. This seems to be the best of the three.
 Does it have a name that is clear enough about the format of its state?
 Noting that its Java package name includes zetasketch, perhaps.
 2. Deprecate the others, at least. And remove them from e.g. Javadoc.

>>>
>>> +1
>>>
>>>
 On Wed, Nov 13, 2019 at 10:01 AM Reuven Lax  wrote:

>
>
> On Wed, Nov 13, 2019 at 9:58 AM Ahmet Altay  wrote:
>
>> Thank you for writing this summary.
>>
>> On Tue, Nov 12, 2019 at 6:35 PM Reza Rokni  wrote:
>>
>>> Hi everyone;
>>>
>>> TL/DR : Discussion on Beam's various Approximate Distinct Count
>>> 

Re: Full stream-stream join semantics

2019-11-25 Thread Jan Lukavský

Hi Rui,

> Hi Kenn, you think stateful DoFn based join can emit joined rows that 
never to be retracted because in stateful DoFn case joined rows will be 
controlled by timers and emit will be only once? If so I will agree with 
it. Generally speaking, if only emit once is the factor of needing 
retraction or not.


that would imply buffering elements up until watermark, then sorting and 
so reduces to the option a) again, is that true? This also has to deal 
with allowed lateness, that would mean, that with allowed lateness 
greater than zero, there can still be multiple firings and so 
retractions are needed.


Jan

On 11/25/19 10:17 PM, Rui Wang wrote:



On Mon, Nov 25, 2019 at 11:29 AM Jan Lukavský > wrote:



On 11/25/19 7:47 PM, Kenneth Knowles wrote:



On Sun, Nov 24, 2019 at 12:57 AM Jan Lukavský mailto:je...@seznam.cz>> wrote:

I can put down a design document, but before that I need to
clarify some things for me. I'm struggling to put all of this
into a bigger picture. Sorry if the arguments are
circulating, but I didn't notice any proposal of how to solve
these. If anyone can disprove any of this logic it would be
very much appreciated as I might be able to get from a dead end:

 a) in the bi-temporal join you can either buffer until
watermark, or emit false data that has to be retracted

This is not the case. A stateful DoFn based join can emit
immediately joined rows that will never need to be retracted. The
need for retractions has to do with CoGBK-based implementation of
a join.

I fail to see how this could work. If I emit joined rows
immediately without waiting for watermark to pass, I can join two
elements, that don't belong to each other, because later can
arrive element with lower time distance, that should have been
joint in the place of the previously emitted one. This is wrong
result that has to be retracted. Or what I'm missing?


Hi Kenn, you think stateful DoFn based join can emit joined rows that 
never to be retracted because in stateful DoFn case joined rows will 
be controlled by timers and emit will be only once? If so I will agree 
with it. Generally speaking, if only emit once is the factor of 
needing retraction or not.


In the past brainstorming, even having retractions ready, streaming 
join with windowing are likely be implemented by a style of CoGBK + 
stateful DoFn.




I suggest that you work out the definition of the join you are
interested in, with a good amount of mathematical rigor, and then
consider the ways you can implement it. That is where a design
doc will probably clarify things.

Kenn

 b) until retractions are 100% functional (and that is sort
of holy grail for now), then the only solution is using a
buffer holding data up to watermark *and then sort by event time*

 c) even if retractions were 100% functional, there would
have to be special implementation for batch case, because
otherwise this would simply blow up downstream processing
with insanely many false additions and subsequent retractions

Property b) means that if we want this feature now, we must
sort by event time and there is no way around. Property c)
shows that even in the future, we must make (in certain
cases) distinction between batch and streaming code paths,
which seems weird to me, but it might be an option. But
still, there is no way to express this join in batch case,
because it would require either buffering (up to) whole input
on local worker (doesn't look like viable option) or provide
a way in user code to signal the need for ordering of data
inside GBK (and we are there again :)). Yes, we might shift
this need from stateful dofn to GBK like

 input.apply(GroupByKey.sorted())

I cannot find a good reasoning why this would be better than
giving this semantics to (stateful) ParDo.

Maybe someone can help me out here?

Jan

On 11/24/19 5:05 AM, Kenneth Knowles wrote:

I don't actually see how event time sorting simplifies this
case much. You still need to buffer elements until they can
no longer be matched in the join, and you still need to
query that buffer for elements that might match. The general
"bi-temporal join" (without sorting) requires one new state
type and then it has identical API, does not require any
novel data structures or reasoning, yields better latency
(no sort buffer delay), and discards less data (no sort
buffer cutoff; watermark is better). Perhaps a design
document about this specific case would clarify.

Kenn

On Fri, Nov 22, 2019 at 10:08 PM Jan Lukavský
mailto:je...@seznam.cz>> wrote:

I didn't 

Re: real real-time beam

2019-11-25 Thread Jan Lukavský

Hi Aaron,

maybe someone else will give another option, but if I understand 
correctly what you want to solve, then you essentially have to do either:


 a) use the compare & swap mechanism in the sink you described

 b) use a buffer to buffer elements inside the outputting ParDo and 
only output them when watermark passes (using a timer).


There is actually an ongoing discussion about how to make option b) 
user-friendly and part of Beam itself, but currently there is no 
out-of-the-box solution for that.


Jan

On 11/25/19 10:27 PM, Aaron Dixon wrote:
Suppose I trigger a Combine per-element (in a high-volume stream) and 
use a ParDo as a sink.


I assume there is no guarantee about the order that my ParDo will see 
these triggers, especially as it processes in parallel, anyway.


That said, my sink writes to a db or cache and I would not like the 
cache to ever regress its value to something "before" what it has 
already written.


Is the best way to solve this problem to always write the event-time 
in the cache and do a compare-and-swap only updating the sink if the 
triggered value in-hand is later than the target value?


Or is there a better way to guarantee that my ParDo sink will process 
elements in-order? (Eg, if I can give up per-event/real-time, then a 
delay-based trigger would probably be sufficient I imagine.)


Thanks for advice!


Re: real real-time beam

2019-11-25 Thread Pablo Estrada
If I understand correctly - your pipeline has some kind of windowing, and
on every trigger downstream of the combiner, the pipeline updates a cache
with a single, non-windowed value. Is that correct?

What are your keys for this pipeline? You could work this out with, as you
noted, a timer that fires periodically, and keeps some state with the value
that you want to update to the cache.

Is this a Python or Java pipeline? What is the runner?
Best
-P.

On Mon, Nov 25, 2019 at 1:27 PM Aaron Dixon  wrote:

> Suppose I trigger a Combine per-element (in a high-volume stream) and use
> a ParDo as a sink.
>
> I assume there is no guarantee about the order that my ParDo will see
> these triggers, especially as it processes in parallel, anyway.
>
> That said, my sink writes to a db or cache and I would not like the cache
> to ever regress its value to something "before" what it has already written.
>
> Is the best way to solve this problem to always write the event-time in
> the cache and do a compare-and-swap only updating the sink if the triggered
> value in-hand is later than the target value?
>
> Or is there a better way to guarantee that my ParDo sink will process
> elements in-order? (Eg, if I can give up per-event/real-time, then a
> delay-based trigger would probably be sufficient I imagine.)
>
> Thanks for advice!
>


[DISCUSS] AWS IOs V1 Deprecation Plan

2019-11-25 Thread Cam Mach
Hello Beam Devs,

I have been working on the migration of Amazon Web Services IO connectors
into the new AWS SDK for Java V2. The goal is to have an updated
implementation aligned with the most recent AWS improvements. So far we
have already migrated the connectors for AWS SNS, SQS and  DynamoDB.

In the meantime some contributions are still going on V1 IOs. So far we
have dealt with those by porting (or asking contributors) to port the
changes into V2 IOs too because we don’t want features of both versions to
be unaligned but this may quickly become a maintenance issue, so we want to
discuss a plan to stop supporting (deprecate) V1 IOs and encourage users to
move to V2.

Phase I (ASAP):

   -

   Mark migrated AWS V1 IOs as deprecated
   -

   Document migration path to V2

Phase II (end of 2020):

   -

   Decide a date or Beam release to remove the V1 IOs
   -

   Send a notification to the community 3 months before we remove them
   -

   Completely get rid of V1 IOs


Please let me know what you think or if you see any potential issues?

Thanks,
Cam Mach


real real-time beam

2019-11-25 Thread Aaron Dixon
Suppose I trigger a Combine per-element (in a high-volume stream) and use a
ParDo as a sink.

I assume there is no guarantee about the order that my ParDo will see these
triggers, especially as it processes in parallel, anyway.

That said, my sink writes to a db or cache and I would not like the cache
to ever regress its value to something "before" what it has already written.

Is the best way to solve this problem to always write the event-time in the
cache and do a compare-and-swap only updating the sink if the triggered
value in-hand is later than the target value?

Or is there a better way to guarantee that my ParDo sink will process
elements in-order? (Eg, if I can give up per-event/real-time, then a
delay-based trigger would probably be sufficient I imagine.)

Thanks for advice!


Re: [VOTE] Beam Mascot animal choice: vote for as many as you want

2019-11-25 Thread David Cavazos
Hi Kenneth, I tried adding back the email addresses, but they weren't added
on the existing responses, it would only add them on new ones. :(

I've already made it not accept new responses.

There are only 8 responses (2 mine, 1 my real vote and 1 empty test vote),
so hopefully everyone who voted there can vote back here.

On Sat, Nov 23, 2019 at 7:27 PM Kenneth Knowles  wrote:

> David - if you can reconfigure the form so it is not anonymous (at least
> to me) then I may be up for including those results in the tally. I don't
> want to penalize those who voted via the form. But since there are now two
> voting channels we have to dedupe or discard the form results. And I need
> to be able to see which votes are PMC. Even if advisory, it does need to
> move to a concluding vote, and PMC votes could be a tiebreaker of sorts.
>
> Kenn
>
> On Sat, Nov 23, 2019 at 7:17 PM Kenneth Knowles  wrote:
>
>> On Fri, Nov 22, 2019 at 10:24 AM Robert Bradshaw 
>> wrote:
>>
>>> On Thu, Nov 21, 2019 at 7:05 PM David Cavazos 
>>> wrote:
>>>


 I created this Google Form
 
 if everyone is okay with it to make it easier to both vote and view the
 results :)

>>>
>>> Generally decisions, especially votes, for apache projects are supposed
>>> to happen on-list. I suppose this is more an advisory vote, but still
>>> probably makes sense to keep it here. .
>>>
>>
>> Indeed. Someone suggested a Google form before I started this, but I
>> deliberately didn't use it. It doesn't add much and it puts the vote off
>> list onto opaque and mutable third party infrastructure.
>>
>> If you voted on the form, please repeat it on thread so I can count it.
>>
>> Kenn
>>
>>
>>
>> import collections, pprint, re, requests
>>> thread = requests.get('
>>> https://lists.apache.org/api/thread.lua?id=ff60eabbf8349ba6951633869000356c2c2feb48bbff187cf3c60039@%3Cdev.beam.apache.org%3E').json(
>>> )
>>> counts = collections.defaultdict(int)
>>> for email in thread['emails']:
>>>   body = requests.get('https://lists.apache.org/api/email.lua?id=%s' %
>>> email['mid']).json()['body']
>>>   for vote in re.findall(r'\n\s*\[\s*[xX]\s*\]\s*([a-zA-Z ]+)', body):
>>> counts[vote] += 1
>>>   pprint.pprint(sorted(counts.items(), key=lambda kv: kv[-1]))
>>>
>>> ...
>>>
>>> [('Beaver', 1),
>>>
>>>  ('Capybara', 2),
>>>
>>>  ('Trout', 2),
>>>
>>>  ('Salmon', 4),
>>>
>>>  ('Dumbo Octopus', 7),
>>>
>>>  ('Robot dinosaur', 9),
>>>
>>>  ('Hedgehog', 10),
>>>
>>>  ('Cuttlefish', 11),
>>>
>>>  ('Angler fish', 12),
>>>
>>>  ('Lemur', 14),
>>>
>>>  ('Owl', 15),
>>>
>>>  ('Firefly', 17)]
>>>
>>>
>>>

 On Thu, Nov 21, 2019 at 6:18 PM Vinay Mayar 
 wrote:

> [ ] Beaver
> [ ] Hedgehog
> [ ] Lemur
> [ ] Owl
> [ ] Salmon
> [ ] Trout
> [ ] Robot dinosaur
> [ ] Firefly
> [ ] Cuttlefish
> [x] Dumbo Octopus
> [ ] Angler fish
>
> On Thu, Nov 21, 2019 at 6:14 PM Chamikara Jayalath <
> chamik...@google.com> wrote:
>
>> [X] Beaver
>> [ ] Hedgehog
>> [ ] Lemur
>> [X] Owl
>> [ ] Salmon
>> [ ] Trout
>> [ ] Robot dinosaur
>> [ ] Firefly
>> [X ] Cuttlefish
>> [X ] Dumbo Octopus
>> [ X] Angler fish
>>
>> Thanks,
>> Cham
>>
>> On Thu, Nov 21, 2019 at 1:43 PM Michał Walenia <
>> michal.wale...@polidea.com> wrote:
>>
>>> [X] Beaver
>>> [ ] Hedgehog
>>> [X] Lemur
>>> [X] Owl
>>> [ ] Salmon
>>> [ ] Trout
>>> [X] Robot dinosaur
>>> [X] Firefly
>>> [ ] Cuttlefish
>>> [ ] Dumbo Octopus
>>> [ ] Angler fish
>>>
>>> On Thu, Nov 21, 2019 at 1:11 PM Aizhamal Nurmamat kyzy <
>>> aizha...@apache.org> wrote:
>>>
 [ ] Beaver
 [X] Hedgehog
 [ ] Lemur
 [ ] Owl
 [ ] Salmon
 [ ] Trout
 [ ] Robot dinosaur
 [ ] Firefly
 [X] Cuttlefish
 [ ] Dumbo Octopus
 [ ] Angler fish

 On Thu, Nov 21, 2019 at 11:21 AM Robert Burke 
 wrote:

> [ X] Beaver
> [] Hedgehog
> [ x] Lemur
> [ X] Owl
> [ ] Salmon
> [ ] Trout
> [ ] Robot dinosaur
> [X ] Firefly
> [ X] Cuttlefish
> [x ] Dumbo Octopus
> [X ] Angler fish
>
> On Thu, Nov 21, 2019, 9:33 AM Łukasz Gajowy <
> lukasz.gaj...@gmail.com> wrote:
>
>> [ ] Beaver
>> [ ] Hedgehog
>> [x] Lemur
>> [x] Owl
>> [ ] Salmon
>> [ ] Trout
>> [x] Robot dinosaur!
>> [ ] Firefly
>> [ ] Cuttlefish
>> [ ] Dumbo Octopus
>> [ ] Angler fish
>>
>> czw., 21 lis 2019 o 00:44 Augustin Lafanechere <
>> augustin.lafanech...@kapten.com> napisał(a):
>>
>>> [ ] Beaver
>>> [ ] Hedgehog
>>> [ 

Re: Full stream-stream join semantics

2019-11-25 Thread Rui Wang
On Mon, Nov 25, 2019 at 11:29 AM Jan Lukavský  wrote:

>
> On 11/25/19 7:47 PM, Kenneth Knowles wrote:
>
>
>
> On Sun, Nov 24, 2019 at 12:57 AM Jan Lukavský  wrote:
>
>> I can put down a design document, but before that I need to clarify some
>> things for me. I'm struggling to put all of this into a bigger picture.
>> Sorry if the arguments are circulating, but I didn't notice any proposal of
>> how to solve these. If anyone can disprove any of this logic it would be
>> very much appreciated as I might be able to get from a dead end:
>>
>>  a) in the bi-temporal join you can either buffer until watermark, or
>> emit false data that has to be retracted
>>
> This is not the case. A stateful DoFn based join can emit immediately
> joined rows that will never need to be retracted. The need for retractions
> has to do with CoGBK-based implementation of a join.
>
> I fail to see how this could work. If I emit joined rows immediately
> without waiting for watermark to pass, I can join two elements, that don't
> belong to each other, because later can arrive element with lower time
> distance, that should have been joint in the place of the previously
> emitted one. This is wrong result that has to be retracted. Or what I'm
> missing?
>

Hi Kenn, you think stateful DoFn based join can emit joined rows that never
to be retracted because in stateful DoFn case joined rows will be
controlled by timers and emit will be only once? If so I will agree with
it. Generally speaking, if only emit once is the factor of needing
retraction or not.

In the past brainstorming, even having retractions ready, streaming join
with windowing are likely be implemented by a style of CoGBK + stateful
DoFn.



I suggest that you work out the definition of the join you are interested
> in, with a good amount of mathematical rigor, and then consider the ways
> you can implement it. That is where a design doc will probably clarify
> things.
>
> Kenn
>
>  b) until retractions are 100% functional (and that is sort of holy grail
>> for now), then the only solution is using a buffer holding data up to
>> watermark *and then sort by event time*
>>
>  c) even if retractions were 100% functional, there would have to be
>> special implementation for batch case, because otherwise this would simply
>> blow up downstream processing with insanely many false additions and
>> subsequent retractions
>>
>> Property b) means that if we want this feature now, we must sort by event
>> time and there is no way around. Property c) shows that even in the future,
>> we must make (in certain cases) distinction between batch and streaming
>> code paths, which seems weird to me, but it might be an option. But still,
>> there is no way to express this join in batch case, because it would
>> require either buffering (up to) whole input on local worker (doesn't look
>> like viable option) or provide a way in user code to signal the need for
>> ordering of data inside GBK (and we are there again :)). Yes, we might
>> shift this need from stateful dofn to GBK like
>>
>>  input.apply(GroupByKey.sorted())
>>
>> I cannot find a good reasoning why this would be better than giving this
>> semantics to (stateful) ParDo.
>>
>> Maybe someone can help me out here?
>>
>> Jan
>> On 11/24/19 5:05 AM, Kenneth Knowles wrote:
>>
>> I don't actually see how event time sorting simplifies this case much.
>> You still need to buffer elements until they can no longer be matched in
>> the join, and you still need to query that buffer for elements that might
>> match. The general "bi-temporal join" (without sorting) requires one new
>> state type and then it has identical API, does not require any novel data
>> structures or reasoning, yields better latency (no sort buffer delay), and
>> discards less data (no sort buffer cutoff; watermark is better). Perhaps a
>> design document about this specific case would clarify.
>>
>> Kenn
>>
>> On Fri, Nov 22, 2019 at 10:08 PM Jan Lukavský  wrote:
>>
>>> I didn't want to go too much into detail, but to describe the idea
>>> roughly (ignoring the problem of different window fns on both sides to keep
>>> it as simple as possible):
>>>
>>> rhs -  \
>>>
>>> flatten (on global window)  stateful par do (sorted
>>> by event time)   output
>>>
>>> lhs -  /
>>>
>>> If we can guarantee event time order arrival of events into the stateful
>>> pardo, then the whole complexity reduces to keep current value of left and
>>> right element and just flush them out each time there is an update. That is
>>> the "knob" is actually when watermark moves, because it is what tells the
>>> join operation that there will be no more (not late) input. This is very,
>>> very simplified, but depicts the solution. The "classical" windowed join
>>> reduces to this if all data in each window is projected onto window end
>>> boundary. Then there will be a cartesian product, because all the elements
>>> have the same timestamp. I can put this into a design 

Re: Failed retrieving service account

2019-11-25 Thread Yifan Zou
Hi,

I've looked into this issue and found that the default service account was
removed during the weekend for some reason log viewer

.
I restored the default service account. All workers should be backing to
normal and jobs start passing now.

-yifan

On Mon, Nov 25, 2019 at 11:17 AM Tomo Suzuki  wrote:

> Thank you for looking into this.
>
> On Mon, Nov 25, 2019 at 12:59 PM Yifan Zou  wrote:
>
>> Greetings,
>>
>> We're seeing some tests encountering permission issues such as *'Failed
>> to retrieve
>> http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/844138762903-comp...@developer.gserviceaccount.com/token
>> 
>> from the Google Compute Enginemetadata service. Status: 404
>> Response:\nb\'"The service account was not found.'*
>>
>> I am looking onto it. We might need to reboot some build workers to
>> restore the service account access. I'll try to make as little impact as
>> possible on current running jobs.
>>
>> -yifan
>>
>
>
> --
> Regards,
> Tomo
>


Re: [VOTE] Beam Mascot animal choice: vote for as many as you want

2019-11-25 Thread Mikhail Gryzykhin
[ ] Beaver
[X] Hedgehog
[] Lemur
[X] Owl
[ ] Salmon
[ ] Trout
[X] Robot dinosaur
[ ] Firefly
[ ] Cuttlefish
[ ] Dumbo Octopus
[ ] Angler fish
[X] Honey Badger


Re: Full stream-stream join semantics

2019-11-25 Thread Jan Lukavský


On 11/25/19 7:47 PM, Kenneth Knowles wrote:



On Sun, Nov 24, 2019 at 12:57 AM Jan Lukavský > wrote:


I can put down a design document, but before that I need to
clarify some things for me. I'm struggling to put all of this into
a bigger picture. Sorry if the arguments are circulating, but I
didn't notice any proposal of how to solve these. If anyone can
disprove any of this logic it would be very much appreciated as I
might be able to get from a dead end:

 a) in the bi-temporal join you can either buffer until watermark,
or emit false data that has to be retracted

This is not the case. A stateful DoFn based join can emit immediately 
joined rows that will never need to be retracted. The need for 
retractions has to do with CoGBK-based implementation of a join.
I fail to see how this could work. If I emit joined rows immediately 
without waiting for watermark to pass, I can join two elements, that 
don't belong to each other, because later can arrive element with lower 
time distance, that should have been joint in the place of the 
previously emitted one. This is wrong result that has to be retracted. 
Or what I'm missing?


I suggest that you work out the definition of the join you are 
interested in, with a good amount of mathematical rigor, and then 
consider the ways you can implement it. That is where a design doc 
will probably clarify things.


Kenn

 b) until retractions are 100% functional (and that is sort of
holy grail for now), then the only solution is using a buffer
holding data up to watermark *and then sort by event time*

 c) even if retractions were 100% functional, there would have to
be special implementation for batch case, because otherwise this
would simply blow up downstream processing with insanely many
false additions and subsequent retractions

Property b) means that if we want this feature now, we must sort
by event time and there is no way around. Property c) shows that
even in the future, we must make (in certain cases) distinction
between batch and streaming code paths, which seems weird to me,
but it might be an option. But still, there is no way to express
this join in batch case, because it would require either buffering
(up to) whole input on local worker (doesn't look like viable
option) or provide a way in user code to signal the need for
ordering of data inside GBK (and we are there again :)). Yes, we
might shift this need from stateful dofn to GBK like

 input.apply(GroupByKey.sorted())

I cannot find a good reasoning why this would be better than
giving this semantics to (stateful) ParDo.

Maybe someone can help me out here?

Jan

On 11/24/19 5:05 AM, Kenneth Knowles wrote:

I don't actually see how event time sorting simplifies this case
much. You still need to buffer elements until they can no longer
be matched in the join, and you still need to query that buffer
for elements that might match. The general "bi-temporal join"
(without sorting) requires one new state type and then it has
identical API, does not require any novel data structures or
reasoning, yields better latency (no sort buffer delay), and
discards less data (no sort buffer cutoff; watermark is better).
Perhaps a design document about this specific case would clarify.

Kenn

On Fri, Nov 22, 2019 at 10:08 PM Jan Lukavský mailto:je...@seznam.cz>> wrote:

I didn't want to go too much into detail, but to describe the
idea roughly (ignoring the problem of different window fns on
both sides to keep it as simple as possible):

rhs -  \

    flatten (on global window)  stateful par
do (sorted by event time)   output

lhs -  /

If we can guarantee event time order arrival of events into
the stateful pardo, then the whole complexity reduces to keep
current value of left and right element and just flush them
out each time there is an update. That is the "knob" is
actually when watermark moves, because it is what tells the
join operation that there will be no more (not late) input.
This is very, very simplified, but depicts the solution. The
"classical" windowed join reduces to this if all data in each
window is projected onto window end boundary. Then there will
be a cartesian product, because all the elements have the
same timestamp. I can put this into a design doc with all the
details, I was trying to find out if there is or was any
effort around this.

I was in touch with Reza in the PR #9032, I think that it
currently suffers from problems with running this on batch.

I think I can even (partly) resolve the retraction issue (for
joins), as described on the thread [1]. Shortly, there can be
 

Re: [VOTE] Beam Mascot animal choice: vote for as many as you want

2019-11-25 Thread Daniel Oliveira
I'm also a bit late to the party.

[ ] Beaver
[ ] Hedgehog
[X] Lemur
[X] Owl
[ ] Salmon
[ ] Trout
[ ] Robot dinosaur
[X] Firefly
[X] Cuttlefish
[X] Dumbo Octopus
[ ] Angler fish

On Sun, Nov 24, 2019 at 8:37 AM Matthias Baetens 
wrote:

> In case I'm not too late:
>
> [ ] Beaver
> [ ] Hedgehog
> [ ] Lemur
> [ ] Owl
> [ ] Salmon
> [ ] Trout
> [ ] Robot dinosaur
> [X ] Firefly
> [ ] Cuttlefish
> [X ] Dumbo Octopus
> [ ] Angler fish
>
> I like angler fish a lot, but I think no one will join any meetups since
> they're scary as hell haha
>
>
> On Sun, Nov 24, 2019, 04:27 Kenneth Knowles  wrote:
>
>> David - if you can reconfigure the form so it is not anonymous (at least
>> to me) then I may be up for including those results in the tally. I don't
>> want to penalize those who voted via the form. But since there are now two
>> voting channels we have to dedupe or discard the form results. And I need
>> to be able to see which votes are PMC. Even if advisory, it does need to
>> move to a concluding vote, and PMC votes could be a tiebreaker of sorts.
>>
>> Kenn
>>
>> On Sat, Nov 23, 2019 at 7:17 PM Kenneth Knowles  wrote:
>>
>>> On Fri, Nov 22, 2019 at 10:24 AM Robert Bradshaw 
>>> wrote:
>>>
 On Thu, Nov 21, 2019 at 7:05 PM David Cavazos 
 wrote:

>
>
> I created this Google Form
> 
> if everyone is okay with it to make it easier to both vote and view the
> results :)
>

 Generally decisions, especially votes, for apache projects are supposed
 to happen on-list. I suppose this is more an advisory vote, but still
 probably makes sense to keep it here. .

>>>
>>> Indeed. Someone suggested a Google form before I started this, but I
>>> deliberately didn't use it. It doesn't add much and it puts the vote off
>>> list onto opaque and mutable third party infrastructure.
>>>
>>> If you voted on the form, please repeat it on thread so I can count it.
>>>
>>> Kenn
>>>
>>>
>>>
>>> import collections, pprint, re, requests
 thread = requests.get('
 https://lists.apache.org/api/thread.lua?id=ff60eabbf8349ba6951633869000356c2c2feb48bbff187cf3c60039@%3Cdev.beam.apache.org%3E').json(
 )
 counts = collections.defaultdict(int)
 for email in thread['emails']:
   body = requests.get('https://lists.apache.org/api/email.lua?id=%s' %
 email['mid']).json()['body']
   for vote in re.findall(r'\n\s*\[\s*[xX]\s*\]\s*([a-zA-Z ]+)', body):
 counts[vote] += 1
   pprint.pprint(sorted(counts.items(), key=lambda kv: kv[-1]))

 ...

 [('Beaver', 1),

  ('Capybara', 2),

  ('Trout', 2),

  ('Salmon', 4),

  ('Dumbo Octopus', 7),

  ('Robot dinosaur', 9),

  ('Hedgehog', 10),

  ('Cuttlefish', 11),

  ('Angler fish', 12),

  ('Lemur', 14),

  ('Owl', 15),

  ('Firefly', 17)]



>
> On Thu, Nov 21, 2019 at 6:18 PM Vinay Mayar <
> vinay.ma...@expanseinc.com> wrote:
>
>> [ ] Beaver
>> [ ] Hedgehog
>> [ ] Lemur
>> [ ] Owl
>> [ ] Salmon
>> [ ] Trout
>> [ ] Robot dinosaur
>> [ ] Firefly
>> [ ] Cuttlefish
>> [x] Dumbo Octopus
>> [ ] Angler fish
>>
>> On Thu, Nov 21, 2019 at 6:14 PM Chamikara Jayalath <
>> chamik...@google.com> wrote:
>>
>>> [X] Beaver
>>> [ ] Hedgehog
>>> [ ] Lemur
>>> [X] Owl
>>> [ ] Salmon
>>> [ ] Trout
>>> [ ] Robot dinosaur
>>> [ ] Firefly
>>> [X ] Cuttlefish
>>> [X ] Dumbo Octopus
>>> [ X] Angler fish
>>>
>>> Thanks,
>>> Cham
>>>
>>> On Thu, Nov 21, 2019 at 1:43 PM Michał Walenia <
>>> michal.wale...@polidea.com> wrote:
>>>
 [X] Beaver
 [ ] Hedgehog
 [X] Lemur
 [X] Owl
 [ ] Salmon
 [ ] Trout
 [X] Robot dinosaur
 [X] Firefly
 [ ] Cuttlefish
 [ ] Dumbo Octopus
 [ ] Angler fish

 On Thu, Nov 21, 2019 at 1:11 PM Aizhamal Nurmamat kyzy <
 aizha...@apache.org> wrote:

> [ ] Beaver
> [X] Hedgehog
> [ ] Lemur
> [ ] Owl
> [ ] Salmon
> [ ] Trout
> [ ] Robot dinosaur
> [ ] Firefly
> [X] Cuttlefish
> [ ] Dumbo Octopus
> [ ] Angler fish
>
> On Thu, Nov 21, 2019 at 11:21 AM Robert Burke 
> wrote:
>
>> [ X] Beaver
>> [] Hedgehog
>> [ x] Lemur
>> [ X] Owl
>> [ ] Salmon
>> [ ] Trout
>> [ ] Robot dinosaur
>> [X ] Firefly
>> [ X] Cuttlefish
>> [x ] Dumbo Octopus
>> [X ] Angler fish
>>
>> On Thu, Nov 21, 2019, 9:33 AM Łukasz Gajowy <
>> lukasz.gaj...@gmail.com> wrote:
>>
>>> [ ] Beaver
>>> [ ] Hedgehog

Re: Failed retrieving service account

2019-11-25 Thread Tomo Suzuki
Thank you for looking into this.

On Mon, Nov 25, 2019 at 12:59 PM Yifan Zou  wrote:

> Greetings,
>
> We're seeing some tests encountering permission issues such as *'Failed
> to retrieve
> http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/844138762903-comp...@developer.gserviceaccount.com/token
> 
> from the Google Compute Enginemetadata service. Status: 404
> Response:\nb\'"The service account was not found.'*
>
> I am looking onto it. We might need to reboot some build workers to
> restore the service account access. I'll try to make as little impact as
> possible on current running jobs.
>
> -yifan
>


-- 
Regards,
Tomo


Re: Full stream-stream join semantics

2019-11-25 Thread Kenneth Knowles
On Sun, Nov 24, 2019 at 12:57 AM Jan Lukavský  wrote:

> I can put down a design document, but before that I need to clarify some
> things for me. I'm struggling to put all of this into a bigger picture.
> Sorry if the arguments are circulating, but I didn't notice any proposal of
> how to solve these. If anyone can disprove any of this logic it would be
> very much appreciated as I might be able to get from a dead end:
>
>  a) in the bi-temporal join you can either buffer until watermark, or emit
> false data that has to be retracted
>
This is not the case. A stateful DoFn based join can emit immediately
joined rows that will never need to be retracted. The need for retractions
has to do with CoGBK-based implementation of a join.

I suggest that you work out the definition of the join you are interested
in, with a good amount of mathematical rigor, and then consider the ways
you can implement it. That is where a design doc will probably clarify
things.

Kenn

 b) until retractions are 100% functional (and that is sort of holy grail
> for now), then the only solution is using a buffer holding data up to
> watermark *and then sort by event time*
>
 c) even if retractions were 100% functional, there would have to be
> special implementation for batch case, because otherwise this would simply
> blow up downstream processing with insanely many false additions and
> subsequent retractions
>
> Property b) means that if we want this feature now, we must sort by event
> time and there is no way around. Property c) shows that even in the future,
> we must make (in certain cases) distinction between batch and streaming
> code paths, which seems weird to me, but it might be an option. But still,
> there is no way to express this join in batch case, because it would
> require either buffering (up to) whole input on local worker (doesn't look
> like viable option) or provide a way in user code to signal the need for
> ordering of data inside GBK (and we are there again :)). Yes, we might
> shift this need from stateful dofn to GBK like
>
>  input.apply(GroupByKey.sorted())
>
> I cannot find a good reasoning why this would be better than giving this
> semantics to (stateful) ParDo.
>
> Maybe someone can help me out here?
>
> Jan
> On 11/24/19 5:05 AM, Kenneth Knowles wrote:
>
> I don't actually see how event time sorting simplifies this case much. You
> still need to buffer elements until they can no longer be matched in the
> join, and you still need to query that buffer for elements that might
> match. The general "bi-temporal join" (without sorting) requires one new
> state type and then it has identical API, does not require any novel data
> structures or reasoning, yields better latency (no sort buffer delay), and
> discards less data (no sort buffer cutoff; watermark is better). Perhaps a
> design document about this specific case would clarify.
>
> Kenn
>
> On Fri, Nov 22, 2019 at 10:08 PM Jan Lukavský  wrote:
>
>> I didn't want to go too much into detail, but to describe the idea
>> roughly (ignoring the problem of different window fns on both sides to keep
>> it as simple as possible):
>>
>> rhs -  \
>>
>> flatten (on global window)  stateful par do (sorted
>> by event time)   output
>>
>> lhs -  /
>>
>> If we can guarantee event time order arrival of events into the stateful
>> pardo, then the whole complexity reduces to keep current value of left and
>> right element and just flush them out each time there is an update. That is
>> the "knob" is actually when watermark moves, because it is what tells the
>> join operation that there will be no more (not late) input. This is very,
>> very simplified, but depicts the solution. The "classical" windowed join
>> reduces to this if all data in each window is projected onto window end
>> boundary. Then there will be a cartesian product, because all the elements
>> have the same timestamp. I can put this into a design doc with all the
>> details, I was trying to find out if there is or was any effort around this.
>>
>> I was in touch with Reza in the PR #9032, I think that it currently
>> suffers from problems with running this on batch.
>>
>> I think I can even (partly) resolve the retraction issue (for joins), as
>> described on the thread [1]. Shortly, there can be two copies of the
>> stateful dofn, one running at watermark and the other at (watermark -
>> allowed lateness). One would produce ON_TIME (maybe wrong) results, the
>> other would produce LATE but correct ones. Being able to compare them, the
>> outcome would be that it would be possible to retract the wrong results.
>>
>> Yes, this is also about providing more evidence of why I think event-time
>> sorting should be (somehow) part of the model. :-)
>>
>> Jan
>>
>> [1]
>> https://lists.apache.org/thread.html/a736ebd6b0913d2e06dfa54797716d022adf2c45140530d5c3bd538d@%3Cdev.beam.apache.org%3E
>> On 11/23/19 5:54 AM, Kenneth Knowles wrote:
>>
>> +Mikhail Gryzykhin  +Rui Wang  

Re: [Portability] Turn off artifact staging?

2019-11-25 Thread Kyle Weaver
Ah didn't see your pull request yet Thomas. Will take a look later.

On Mon, Nov 25, 2019 at 10:23 AM Thomas Weise  wrote:

> Thanks, I would prefer to solve this in a way where the user does not need
> to configure anything extra though.
>
>
> On Mon, Nov 25, 2019 at 10:21 AM Kyle Weaver  wrote:
>
>> When we added the class loader artifact stager, we introduced artifact
>> retrieval service type as a pipeline option. It would make sense to put a
>> "none" option there.
>>
>>
>> https://github.com/apache/beam/blob/5fd93af49e6cb86ff52b20f103371df7e0447b7f/sdks/java/core/src/main/java/org/apache/beam/sdk/options/PortablePipelineOptions.java#L107
>>
>>   RetrievalServiceType getRetrievalServiceType();
>>
>>
>> On Mon, Nov 25, 2019 at 10:05 AM Robert Bradshaw 
>> wrote:
>>
>>> boot.go could be updated to recognize NO_ARTIFACTS_STAGED_TOKEN as
>>> well. (Should this constant be put in a common location?)
>>>
>>> On Sat, Nov 23, 2019 at 9:16 AM Thomas Weise  wrote:
>>> >
>>> > JIRA: https://issues.apache.org/jira/browse/BEAM-8815
>>> >
>>> >
>>> > On Fri, Nov 22, 2019 at 5:31 PM Thomas Weise  wrote:
>>> >>
>>> >> I'm running into the issue Kyle points out when I try to run a
>>> pipeline that does not use artifact staging:
>>> >>
>>> >> 2019-11-23 01:09:18,442 WARN
>>> org.apache.beam.runners.fnexecution.artifact.AbstractArtifactRetrievalService
>>> - GetManifest for
>>> /tmp/beam-artifact-staging/job_53cad419-a8c0-472c-8486-f795cc88a80f/MANIFEST
>>> failed.
>>> >> java.util.concurrent.ExecutionException:
>>> java.io.FileNotFoundException:
>>> /tmp/beam-artifact-staging/job_53cad419-a8c0-472c-8486-f795cc88a80f/MANIFEST
>>> (No such file or directory)
>>> >> at
>>> org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:531)
>>> >> at
>>> org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:492)
>>> >> at
>>> org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:83)
>>> >> at
>>> org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:196)
>>> >> at
>>> org.apache.beam.vendor.guava.v26_0_jre.com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2312)
>>> >>
>>> >> This happens when I use /opt/apache/beam/boot to start the worker in
>>> process environment, as it will attempt to retrieve artifacts. The same
>>> would be the case for worker pool also.
>>> >>
>>> >> Thomas
>>> >>
>>> >>
>>> >> On Tue, Nov 12, 2019 at 5:07 PM Robert Bradshaw 
>>> wrote:
>>> >>>
>>> >>> FWIW, there are also discussions of adding a preparation phase for
>>> sdk
>>> >>> harness (docker) images, such that artifacts could be staged (and
>>> >>> installed, compiled etc.) ahead of time and shipped as part of the
>>> sdk
>>> >>> image rather than via a side channel (and on every worker). Anyone
>>> not
>>> >>> using these images is probably shipping dependencies in another way
>>> >>> anyways.
>>> >>>
>>> >>> On Tue, Nov 12, 2019 at 5:03 PM Robert Bradshaw 
>>> wrote:
>>> >>> >
>>> >>> > Certainly there's a lot to be re-thought in terms of artifact
>>> staging,
>>> >>> > especially when it comes to cross-langauge pipelines. I think it
>>> would
>>> >>> > makes sense to have a special retrieval token for the "empty"
>>> >>> > manifest, which would mean a staging directory would never have to
>>> be
>>> >>> > set up if no artifacts happened to be staged.
>>> >>> >
>>> >>> > The UberJar avoids any artifact staging overhead as well.
>>> >>> >
>>> >>> > On Tue, Nov 12, 2019 at 3:30 PM Kyle Weaver 
>>> wrote:
>>> >>> > >
>>> >>> > > Hi Beamers,
>>> >>> > >
>>> >>> > > We can use artifact staging to make sure SDK workers have access
>>> to a pipeline's dependencies. However, artifact staging is not always
>>> necessary. For example, one can make sure that the environment contains all
>>> the dependencies ahead of time. However, regardless of whether or not
>>> artifacts are used, my understanding is an artifact manifest will be
>>> written and read anyway. For example:
>>> >>> > >
>>> >>> > > INFO AbstractArtifactRetrievalService: GetManifest for
>>> /tmp/beam-artifact-staging/.../MANIFEST -> 0 artifacts
>>> >>> > >
>>> >>> > > This can be a hassle, because users must set up a staging
>>> directory that all workers can access, even if it isn't used aside from the
>>> (empty) manifest [1]. Thomas mentioned that at Lyft they bypass artifact
>>> staging altogether [2]. So I was wondering, do you all think it would be
>>> reasonable or useful to create an "off switch" for artifact staging?
>>> >>> > >
>>> >>> > > Thanks,
>>> >>> > > Kyle
>>> >>> > >
>>> >>> > > [1]
>>> https://lists.apache.org/thread.html/d293b4158f266be1cb6c99c968535706f491fdfcd4bb20c4e30939bb@%3Cdev.beam.apache.org%3E
>>> >>> > > [2]
>>> 

Re: [Portability] Turn off artifact staging?

2019-11-25 Thread Thomas Weise
Thanks, I would prefer to solve this in a way where the user does not need
to configure anything extra though.


On Mon, Nov 25, 2019 at 10:21 AM Kyle Weaver  wrote:

> When we added the class loader artifact stager, we introduced artifact
> retrieval service type as a pipeline option. It would make sense to put a
> "none" option there.
>
>
> https://github.com/apache/beam/blob/5fd93af49e6cb86ff52b20f103371df7e0447b7f/sdks/java/core/src/main/java/org/apache/beam/sdk/options/PortablePipelineOptions.java#L107
>
>   RetrievalServiceType getRetrievalServiceType();
>
>
> On Mon, Nov 25, 2019 at 10:05 AM Robert Bradshaw 
> wrote:
>
>> boot.go could be updated to recognize NO_ARTIFACTS_STAGED_TOKEN as
>> well. (Should this constant be put in a common location?)
>>
>> On Sat, Nov 23, 2019 at 9:16 AM Thomas Weise  wrote:
>> >
>> > JIRA: https://issues.apache.org/jira/browse/BEAM-8815
>> >
>> >
>> > On Fri, Nov 22, 2019 at 5:31 PM Thomas Weise  wrote:
>> >>
>> >> I'm running into the issue Kyle points out when I try to run a
>> pipeline that does not use artifact staging:
>> >>
>> >> 2019-11-23 01:09:18,442 WARN
>> org.apache.beam.runners.fnexecution.artifact.AbstractArtifactRetrievalService
>> - GetManifest for
>> /tmp/beam-artifact-staging/job_53cad419-a8c0-472c-8486-f795cc88a80f/MANIFEST
>> failed.
>> >> java.util.concurrent.ExecutionException:
>> java.io.FileNotFoundException:
>> /tmp/beam-artifact-staging/job_53cad419-a8c0-472c-8486-f795cc88a80f/MANIFEST
>> (No such file or directory)
>> >> at
>> org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:531)
>> >> at
>> org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:492)
>> >> at
>> org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:83)
>> >> at
>> org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:196)
>> >> at
>> org.apache.beam.vendor.guava.v26_0_jre.com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2312)
>> >>
>> >> This happens when I use /opt/apache/beam/boot to start the worker in
>> process environment, as it will attempt to retrieve artifacts. The same
>> would be the case for worker pool also.
>> >>
>> >> Thomas
>> >>
>> >>
>> >> On Tue, Nov 12, 2019 at 5:07 PM Robert Bradshaw 
>> wrote:
>> >>>
>> >>> FWIW, there are also discussions of adding a preparation phase for sdk
>> >>> harness (docker) images, such that artifacts could be staged (and
>> >>> installed, compiled etc.) ahead of time and shipped as part of the sdk
>> >>> image rather than via a side channel (and on every worker). Anyone not
>> >>> using these images is probably shipping dependencies in another way
>> >>> anyways.
>> >>>
>> >>> On Tue, Nov 12, 2019 at 5:03 PM Robert Bradshaw 
>> wrote:
>> >>> >
>> >>> > Certainly there's a lot to be re-thought in terms of artifact
>> staging,
>> >>> > especially when it comes to cross-langauge pipelines. I think it
>> would
>> >>> > makes sense to have a special retrieval token for the "empty"
>> >>> > manifest, which would mean a staging directory would never have to
>> be
>> >>> > set up if no artifacts happened to be staged.
>> >>> >
>> >>> > The UberJar avoids any artifact staging overhead as well.
>> >>> >
>> >>> > On Tue, Nov 12, 2019 at 3:30 PM Kyle Weaver 
>> wrote:
>> >>> > >
>> >>> > > Hi Beamers,
>> >>> > >
>> >>> > > We can use artifact staging to make sure SDK workers have access
>> to a pipeline's dependencies. However, artifact staging is not always
>> necessary. For example, one can make sure that the environment contains all
>> the dependencies ahead of time. However, regardless of whether or not
>> artifacts are used, my understanding is an artifact manifest will be
>> written and read anyway. For example:
>> >>> > >
>> >>> > > INFO AbstractArtifactRetrievalService: GetManifest for
>> /tmp/beam-artifact-staging/.../MANIFEST -> 0 artifacts
>> >>> > >
>> >>> > > This can be a hassle, because users must set up a staging
>> directory that all workers can access, even if it isn't used aside from the
>> (empty) manifest [1]. Thomas mentioned that at Lyft they bypass artifact
>> staging altogether [2]. So I was wondering, do you all think it would be
>> reasonable or useful to create an "off switch" for artifact staging?
>> >>> > >
>> >>> > > Thanks,
>> >>> > > Kyle
>> >>> > >
>> >>> > > [1]
>> https://lists.apache.org/thread.html/d293b4158f266be1cb6c99c968535706f491fdfcd4bb20c4e30939bb@%3Cdev.beam.apache.org%3E
>> >>> > > [2]
>> https://issues.apache.org/jira/browse/BEAM-5187?focusedCommentId=16972715=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16972715
>>
>


Re: [Portability] Turn off artifact staging?

2019-11-25 Thread Kyle Weaver
When we added the class loader artifact stager, we introduced artifact
retrieval service type as a pipeline option. It would make sense to put a
"none" option there.

https://github.com/apache/beam/blob/5fd93af49e6cb86ff52b20f103371df7e0447b7f/sdks/java/core/src/main/java/org/apache/beam/sdk/options/PortablePipelineOptions.java#L107

  RetrievalServiceType getRetrievalServiceType();


On Mon, Nov 25, 2019 at 10:05 AM Robert Bradshaw 
wrote:

> boot.go could be updated to recognize NO_ARTIFACTS_STAGED_TOKEN as
> well. (Should this constant be put in a common location?)
>
> On Sat, Nov 23, 2019 at 9:16 AM Thomas Weise  wrote:
> >
> > JIRA: https://issues.apache.org/jira/browse/BEAM-8815
> >
> >
> > On Fri, Nov 22, 2019 at 5:31 PM Thomas Weise  wrote:
> >>
> >> I'm running into the issue Kyle points out when I try to run a pipeline
> that does not use artifact staging:
> >>
> >> 2019-11-23 01:09:18,442 WARN
> org.apache.beam.runners.fnexecution.artifact.AbstractArtifactRetrievalService
> - GetManifest for
> /tmp/beam-artifact-staging/job_53cad419-a8c0-472c-8486-f795cc88a80f/MANIFEST
> failed.
> >> java.util.concurrent.ExecutionException: java.io.FileNotFoundException:
> /tmp/beam-artifact-staging/job_53cad419-a8c0-472c-8486-f795cc88a80f/MANIFEST
> (No such file or directory)
> >> at
> org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:531)
> >> at
> org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:492)
> >> at
> org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:83)
> >> at
> org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:196)
> >> at
> org.apache.beam.vendor.guava.v26_0_jre.com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2312)
> >>
> >> This happens when I use /opt/apache/beam/boot to start the worker in
> process environment, as it will attempt to retrieve artifacts. The same
> would be the case for worker pool also.
> >>
> >> Thomas
> >>
> >>
> >> On Tue, Nov 12, 2019 at 5:07 PM Robert Bradshaw 
> wrote:
> >>>
> >>> FWIW, there are also discussions of adding a preparation phase for sdk
> >>> harness (docker) images, such that artifacts could be staged (and
> >>> installed, compiled etc.) ahead of time and shipped as part of the sdk
> >>> image rather than via a side channel (and on every worker). Anyone not
> >>> using these images is probably shipping dependencies in another way
> >>> anyways.
> >>>
> >>> On Tue, Nov 12, 2019 at 5:03 PM Robert Bradshaw 
> wrote:
> >>> >
> >>> > Certainly there's a lot to be re-thought in terms of artifact
> staging,
> >>> > especially when it comes to cross-langauge pipelines. I think it
> would
> >>> > makes sense to have a special retrieval token for the "empty"
> >>> > manifest, which would mean a staging directory would never have to be
> >>> > set up if no artifacts happened to be staged.
> >>> >
> >>> > The UberJar avoids any artifact staging overhead as well.
> >>> >
> >>> > On Tue, Nov 12, 2019 at 3:30 PM Kyle Weaver 
> wrote:
> >>> > >
> >>> > > Hi Beamers,
> >>> > >
> >>> > > We can use artifact staging to make sure SDK workers have access
> to a pipeline's dependencies. However, artifact staging is not always
> necessary. For example, one can make sure that the environment contains all
> the dependencies ahead of time. However, regardless of whether or not
> artifacts are used, my understanding is an artifact manifest will be
> written and read anyway. For example:
> >>> > >
> >>> > > INFO AbstractArtifactRetrievalService: GetManifest for
> /tmp/beam-artifact-staging/.../MANIFEST -> 0 artifacts
> >>> > >
> >>> > > This can be a hassle, because users must set up a staging
> directory that all workers can access, even if it isn't used aside from the
> (empty) manifest [1]. Thomas mentioned that at Lyft they bypass artifact
> staging altogether [2]. So I was wondering, do you all think it would be
> reasonable or useful to create an "off switch" for artifact staging?
> >>> > >
> >>> > > Thanks,
> >>> > > Kyle
> >>> > >
> >>> > > [1]
> https://lists.apache.org/thread.html/d293b4158f266be1cb6c99c968535706f491fdfcd4bb20c4e30939bb@%3Cdev.beam.apache.org%3E
> >>> > > [2]
> https://issues.apache.org/jira/browse/BEAM-5187?focusedCommentId=16972715=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16972715
>


Re: [Portability] Turn off artifact staging?

2019-11-25 Thread Robert Bradshaw
boot.go could be updated to recognize NO_ARTIFACTS_STAGED_TOKEN as
well. (Should this constant be put in a common location?)

On Sat, Nov 23, 2019 at 9:16 AM Thomas Weise  wrote:
>
> JIRA: https://issues.apache.org/jira/browse/BEAM-8815
>
>
> On Fri, Nov 22, 2019 at 5:31 PM Thomas Weise  wrote:
>>
>> I'm running into the issue Kyle points out when I try to run a pipeline that 
>> does not use artifact staging:
>>
>> 2019-11-23 01:09:18,442 WARN  
>> org.apache.beam.runners.fnexecution.artifact.AbstractArtifactRetrievalService
>>   - GetManifest for 
>> /tmp/beam-artifact-staging/job_53cad419-a8c0-472c-8486-f795cc88a80f/MANIFEST 
>> failed.
>> java.util.concurrent.ExecutionException: java.io.FileNotFoundException: 
>> /tmp/beam-artifact-staging/job_53cad419-a8c0-472c-8486-f795cc88a80f/MANIFEST 
>> (No such file or directory)
>> at 
>> org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:531)
>> at 
>> org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:492)
>> at 
>> org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:83)
>> at 
>> org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:196)
>> at 
>> org.apache.beam.vendor.guava.v26_0_jre.com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2312)
>>
>> This happens when I use /opt/apache/beam/boot to start the worker in process 
>> environment, as it will attempt to retrieve artifacts. The same would be the 
>> case for worker pool also.
>>
>> Thomas
>>
>>
>> On Tue, Nov 12, 2019 at 5:07 PM Robert Bradshaw  wrote:
>>>
>>> FWIW, there are also discussions of adding a preparation phase for sdk
>>> harness (docker) images, such that artifacts could be staged (and
>>> installed, compiled etc.) ahead of time and shipped as part of the sdk
>>> image rather than via a side channel (and on every worker). Anyone not
>>> using these images is probably shipping dependencies in another way
>>> anyways.
>>>
>>> On Tue, Nov 12, 2019 at 5:03 PM Robert Bradshaw  wrote:
>>> >
>>> > Certainly there's a lot to be re-thought in terms of artifact staging,
>>> > especially when it comes to cross-langauge pipelines. I think it would
>>> > makes sense to have a special retrieval token for the "empty"
>>> > manifest, which would mean a staging directory would never have to be
>>> > set up if no artifacts happened to be staged.
>>> >
>>> > The UberJar avoids any artifact staging overhead as well.
>>> >
>>> > On Tue, Nov 12, 2019 at 3:30 PM Kyle Weaver  wrote:
>>> > >
>>> > > Hi Beamers,
>>> > >
>>> > > We can use artifact staging to make sure SDK workers have access to a 
>>> > > pipeline's dependencies. However, artifact staging is not always 
>>> > > necessary. For example, one can make sure that the environment contains 
>>> > > all the dependencies ahead of time. However, regardless of whether or 
>>> > > not artifacts are used, my understanding is an artifact manifest will 
>>> > > be written and read anyway. For example:
>>> > >
>>> > > INFO AbstractArtifactRetrievalService: GetManifest for 
>>> > > /tmp/beam-artifact-staging/.../MANIFEST -> 0 artifacts
>>> > >
>>> > > This can be a hassle, because users must set up a staging directory 
>>> > > that all workers can access, even if it isn't used aside from the 
>>> > > (empty) manifest [1]. Thomas mentioned that at Lyft they bypass 
>>> > > artifact staging altogether [2]. So I was wondering, do you all think 
>>> > > it would be reasonable or useful to create an "off switch" for artifact 
>>> > > staging?
>>> > >
>>> > > Thanks,
>>> > > Kyle
>>> > >
>>> > > [1] 
>>> > > https://lists.apache.org/thread.html/d293b4158f266be1cb6c99c968535706f491fdfcd4bb20c4e30939bb@%3Cdev.beam.apache.org%3E
>>> > > [2] 
>>> > > https://issues.apache.org/jira/browse/BEAM-5187?focusedCommentId=16972715=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16972715


Failed retrieving service account

2019-11-25 Thread Yifan Zou
Greetings,

We're seeing some tests encountering permission issues such as *'Failed to
retrieve
http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/844138762903-comp...@developer.gserviceaccount.com/token

from the Google Compute Enginemetadata service. Status: 404
Response:\nb\'"The service account was not found.'*

I am looking onto it. We might need to reboot some build workers to restore
the service account access. I'll try to make as little impact as possible
on current running jobs.

-yifan


Re: [ANNOUNCE] New committer: Daniel Oliveira

2019-11-25 Thread Mark Liu
Congratulations, Daniel!

On Mon, Nov 25, 2019 at 9:31 AM Ahmet Altay  wrote:

> Congratulations, Daniel!
>
> On Sat, Nov 23, 2019 at 3:47 AM jincheng sun 
> wrote:
>
>>
>> Congrats, Daniel!
>> Best,
>> Jincheng
>>
>> Alexey Romanenko  于2019年11月22日周五 下午5:47写道:
>>
>>> Congratulations, Daniel!
>>>
>>> On 22 Nov 2019, at 09:18, Jan Lukavský  wrote:
>>>
>>> Congrats Daniel!
>>> On 11/21/19 10:11 AM, Gleb Kanterov wrote:
>>>
>>> Congratulations!
>>>
>>> On Thu, Nov 21, 2019 at 6:24 AM Thomas Weise  wrote:
>>>
 Congratulations!


 On Wed, Nov 20, 2019, 7:56 PM Chamikara Jayalath 
 wrote:

> Congrats!!
>
> On Wed, Nov 20, 2019 at 5:21 PM Daniel Oliveira <
> danolive...@google.com> wrote:
>
>> Thank you everyone! I won't let you down. o7
>>
>> On Wed, Nov 20, 2019 at 2:12 PM Ruoyun Huang 
>> wrote:
>>
>>> Congrats Daniel!
>>>
>>> On Wed, Nov 20, 2019 at 1:58 PM Robert Burke 
>>> wrote:
>>>
 Congrats Daniel! Much deserved.

 On Wed, Nov 20, 2019, 12:49 PM Udi Meiri  wrote:

> Congrats Daniel!
>
> On Wed, Nov 20, 2019 at 12:42 PM Kyle Weaver 
> wrote:
>
>> Congrats Dan! Keep up the good work :)
>>
>> On Wed, Nov 20, 2019 at 12:41 PM Cyrus Maden 
>> wrote:
>>
>>> Congratulations! This is great news.
>>>
>>> On Wed, Nov 20, 2019 at 3:24 PM Rui Wang 
>>> wrote:
>>>
 Congrats!


 -Rui

 On Wed, Nov 20, 2019 at 11:48 AM Valentyn Tymofieiev <
 valen...@google.com> wrote:

> Congrats, Daniel!
>
> On Wed, Nov 20, 2019 at 11:47 AM Kenneth Knowles <
> k...@apache.org> wrote:
>
>> Hi all,
>>
>> Please join me and the rest of the Beam PMC in welcoming a
>> new committer: Daniel Oliveira
>>
>> Daniel introduced himself to dev@ over two years ago and has
>> contributed in many ways since then. Daniel has contributed to 
>> general
>> project health, the portability framework, and all three 
>> languages: Java,
>> Python SDK, and Go. I would like to particularly highlight how 
>> he deleted
>> 12k lines of dead reference runner code [1].
>>
>> In consideration of Daniel's contributions, the Beam PMC
>> trusts him with the responsibilities of a Beam committer [2].
>>
>> Thank you, Daniel, for your contributions and looking forward
>> to many more!
>>
>> Kenn, on behalf of the Apache Beam PMC
>>
>> [1] https://github.com/apache/beam/pull/8380
>> [2]
>> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>>
>
>>>
>>> --
>>> 
>>> Ruoyun  Huang
>>>
>>>
>>>


Re: [ANNOUNCE] New committer: Daniel Oliveira

2019-11-25 Thread Ahmet Altay
Congratulations, Daniel!

On Sat, Nov 23, 2019 at 3:47 AM jincheng sun 
wrote:

>
> Congrats, Daniel!
> Best,
> Jincheng
>
> Alexey Romanenko  于2019年11月22日周五 下午5:47写道:
>
>> Congratulations, Daniel!
>>
>> On 22 Nov 2019, at 09:18, Jan Lukavský  wrote:
>>
>> Congrats Daniel!
>> On 11/21/19 10:11 AM, Gleb Kanterov wrote:
>>
>> Congratulations!
>>
>> On Thu, Nov 21, 2019 at 6:24 AM Thomas Weise  wrote:
>>
>>> Congratulations!
>>>
>>>
>>> On Wed, Nov 20, 2019, 7:56 PM Chamikara Jayalath 
>>> wrote:
>>>
 Congrats!!

 On Wed, Nov 20, 2019 at 5:21 PM Daniel Oliveira 
 wrote:

> Thank you everyone! I won't let you down. o7
>
> On Wed, Nov 20, 2019 at 2:12 PM Ruoyun Huang 
> wrote:
>
>> Congrats Daniel!
>>
>> On Wed, Nov 20, 2019 at 1:58 PM Robert Burke 
>> wrote:
>>
>>> Congrats Daniel! Much deserved.
>>>
>>> On Wed, Nov 20, 2019, 12:49 PM Udi Meiri  wrote:
>>>
 Congrats Daniel!

 On Wed, Nov 20, 2019 at 12:42 PM Kyle Weaver 
 wrote:

> Congrats Dan! Keep up the good work :)
>
> On Wed, Nov 20, 2019 at 12:41 PM Cyrus Maden 
> wrote:
>
>> Congratulations! This is great news.
>>
>> On Wed, Nov 20, 2019 at 3:24 PM Rui Wang 
>> wrote:
>>
>>> Congrats!
>>>
>>>
>>> -Rui
>>>
>>> On Wed, Nov 20, 2019 at 11:48 AM Valentyn Tymofieiev <
>>> valen...@google.com> wrote:
>>>
 Congrats, Daniel!

 On Wed, Nov 20, 2019 at 11:47 AM Kenneth Knowles <
 k...@apache.org> wrote:

> Hi all,
>
> Please join me and the rest of the Beam PMC in welcoming a new
> committer: Daniel Oliveira
>
> Daniel introduced himself to dev@ over two years ago and has
> contributed in many ways since then. Daniel has contributed to 
> general
> project health, the portability framework, and all three 
> languages: Java,
> Python SDK, and Go. I would like to particularly highlight how he 
> deleted
> 12k lines of dead reference runner code [1].
>
> In consideration of Daniel's contributions, the Beam PMC
> trusts him with the responsibilities of a Beam committer [2].
>
> Thank you, Daniel, for your contributions and looking forward
> to many more!
>
> Kenn, on behalf of the Apache Beam PMC
>
> [1] https://github.com/apache/beam/pull/8380
> [2]
> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>

>>
>> --
>> 
>> Ruoyun  Huang
>>
>>
>>


Re: Triggers still finish and drop all data

2019-11-25 Thread Kenneth Knowles
And another:
https://stackoverflow.com/questions/55748746/issues-with-dynamic-destinations-in-dataflow

On Thu, Nov 14, 2019 at 1:35 AM Kenneth Knowles  wrote:

>
>
> On Fri, Nov 8, 2019 at 9:44 AM Steve Niemitz  wrote:
>
>> Yeah that looks like what I had in mind too.  I think the most useful
>> notification output would be a KV of (K, summary)?
>>
>
> Sounds about right. Some use cases may not care about the summary, but
> just the notification. But for most runners passing extra in-memory data to
> a subsequent projection which drops it is essentially free.
>
> Kenn
>
>
>> On Fri, Nov 8, 2019 at 12:38 PM Kenneth Knowles  wrote:
>>
>>> This sounds like a useful feature, if I understand it: a generic
>>> transform (build on a generic stateful DoFn) where the end-user provides a
>>> monotonic predicate over the input it has seen. It emits a notification
>>> exactly once when the predicate is first satisfied. To be efficient, it
>>> will also need some form of summarization over the input seen.
>>>
>>> Notify
>>>   .withSummarizer(combineFn)
>>>   .withPredicate(summary -> ...)
>>>
>>> Something like that? The complexity is not much less than just writing a
>>> stateful DoFn directly, but the boilerplate is much less.
>>>
>>> Kenn
>>>
>>> On Thu, Nov 7, 2019 at 2:02 PM Steve Niemitz 
>>> wrote:
>>>
 Interestingly enough, we just had a use case come up that I think could
 have been solved by finishing triggers.

 Basically, we want to emit a notification when a certain threshold is
 reached (in this case, we saw at least N elements for a given key), and
 then never notify again within that window.  As mentioned, we can
 accomplish this using a stateful DoFn as mentioned above, but I thought it
 was interesting that this just came up, and wanted to share.

 Maybe it'd be worth building something to simulate this into the SDK?

 On Mon, Nov 4, 2019 at 8:15 PM Kenneth Knowles  wrote:

> By the way, adding this guard uncovered two bugs in Beam's Java
> codebase, luckily only benchmarks and tests. There were *no* non-buggy
> instances of a finishing trigger. They both declare allowed lateness that
> is never used.
>
> Nexmark query 10:
>
> // Clear fancy triggering from above.
> .apply(
> Window.>into(...)
> .triggering(AfterWatermark.pastEndOfWindow())
> // We expect no late data here, but we'll assume the
> worst so we can detect any.
> .withAllowedLateness(Duration.standardDays(1))
> .discardingFiredPanes())
>
> This is nonsensical: the trigger will fire once and close, never
> firing again. So the allowed lateness has no effect except to change
> counters from "dropped due to lateness" to "dropped due to trigger
> closing". The intent would appear to be to restore the default triggering,
> but it failed.
>
> PipelineTranslationTest:
>
>
>  Window.into(FixedWindows.of(Duration.standardMinutes(7)))
> .triggering(
> AfterWatermark.pastEndOfWindow()
>
> .withEarlyFirings(AfterPane.elementCountAtLeast(19)))
> .accumulatingFiredPanes()
> .withAllowedLateness(Duration.standardMinutes(3L)));
>
> Again, the allowed lateness has no effect. This test is just to test
> portable proto round-trip. But still it is odd to write a nonsensical
> pipeline for this.
>
> Takeaway: experienced Beam developers never use this pattern, but they
> still get it wrong and create pipelines that would have data loss bugs
> because of it.
>
> Since there is no other discussion here, I will trust the community is
> OK with this change and follow Jan's review of my implementation of his
> idea.
>
> Kenn
>
>
> On Thu, Oct 31, 2019 at 4:06 PM Kenneth Knowles 
> wrote:
>
>> Opened https://github.com/apache/beam/pull/9960 for this idea. This
>> will alert users to broken pipelines and force them to alter them.
>>
>> Kenn
>>
>> On Thu, Oct 31, 2019 at 2:12 PM Kenneth Knowles 
>> wrote:
>>
>>> On Thu, Oct 31, 2019 at 2:11 AM Jan Lukavský 
>>> wrote:
>>>
 Hi Kenn,

 does there still remain some use for trigger to finish? If we don't
 drop
 data, would it still be of any use to users? If not, would it be
 better
 to just remove the functionality completely, so that users who use
 it
 (and it will possibly break for them) are aware of it at compile
 time?

 Jan

>>>
>>> Good point. I believe there is no good use for a top-level trigger
>>> finishing. As mentioned, the intended uses aren't really met by 
>>> triggers,
>>> but are met by stateful DoFn.
>>>

Re: Beam Dependency Check Report (2019-11-25)

2019-11-25 Thread Pablo Estrada
+Yifan Zou  : )

On Mon, Nov 25, 2019 at 5:35 AM Tomo Suzuki  wrote:

> Can anybody take action on this error?
>
> > The service account was not found. The instance must be restarted via
> the Compute Engine API to restore service account access.
>
> Regards,
> Tomo
>
> On Mon, Nov 25, 2019 at 07:04 Apache Jenkins Server <
> jenk...@builds.apache.org> wrote:
>
>> ('Failed to retrieve
>> http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/844138762903-comp...@developer.gserviceaccount.com/token
>> from the Google Compute Enginemetadata service. Status: 404
>> Response:\nb\'"The service account was not found. The instance must be
>> restarted via the Compute Engine API to restore service account
>> access."\'', )
>>
>> ('Failed to retrieve
>> http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/844138762903-comp...@developer.gserviceaccount.com/token
>> from the Google Compute Enginemetadata service. Status: 404
>> Response:\nb\'"The service account was not found. The instance must be
>> restarted via the Compute Engine API to restore service account
>> access."\'', )
>> A dependency update is high priority if it satisfies one of following
>> criteria:
>>
>>- It has major versions update available, e.g.
>>org.assertj:assertj-core 2.5.0 -> 3.10.0;
>>
>>
>>- It is over 3 minor versions behind the latest version, e.g.
>>org.tukaani:xz 1.5 -> 1.8;
>>
>>
>>- The current version is behind the later version for over 180 days,
>>e.g. com.google.auto.service:auto-service 2014-10-24 -> 2017-12-11.
>>
>> In Beam, we make a best-effort attempt at keeping all dependencies
>> up-to-date. In the future, issues will be filed and tracked for these
>> automatically, but in the meantime you can search for existing issues or
>> open a new one. For more information: Beam Dependency Guide
>> 
>>
> --
> Regards,
> Tomo
>


Re: [ANNOUNCE] New committer: Brian Hulette

2019-11-25 Thread Ahmet Altay
Congratulations, Brian!

On Tue, Nov 19, 2019 at 11:04 PM Tanay Tummalapalli 
wrote:

> Congratulations!
>
> On Wed, Nov 20, 2019 at 6:15 AM Aizhamal Nurmamat kyzy <
> aizha...@apache.org> wrote:
>
>> Congratulations, Brian!
>>
>> On Mon, Nov 18, 2019 at 10:29 AM Łukasz Gajowy 
>> wrote:
>>
>>> Congratulations! :)
>>>
>>> pt., 15 lis 2019 o 15:52 Chamikara Jayalath 
>>> napisał(a):
>>>
 Congrats, Brian!

 On Fri, Nov 15, 2019 at 11:39 AM Yichi Zhang  wrote:

> Congratulations, Brian!
>
> On Fri, Nov 15, 2019 at 11:12 AM Alan Myrvold 
> wrote:
>
>> Congratulations, Brian!
>>
>> On Fri, Nov 15, 2019 at 11:11 AM Yifan Zou 
>> wrote:
>>
>>> Well deserved. Congratulations!
>>>
>>> On Fri, Nov 15, 2019 at 11:06 AM Kirill Kozlov <
>>> kirillkoz...@google.com> wrote:
>>>
 Congratulations, Brian!

 On Fri, Nov 15, 2019 at 10:56 AM Udi Meiri 
 wrote:

> Congrats Brian!
>
>
> On Fri, Nov 15, 2019 at 10:47 AM Ruoyun Huang 
> wrote:
>
>> Congrats Brian!
>>
>> On Fri, Nov 15, 2019 at 10:41 AM Robin Qiu 
>> wrote:
>>
>>> Congrats, Brian!
>>>
>>> On Fri, Nov 15, 2019 at 10:02 AM Daniel Oliveira <
>>> danolive...@google.com> wrote:
>>>
 Congratulations Brian! It's well deserved.

 On Fri, Nov 15, 2019, 9:37 AM Alexey Romanenko <
 aromanenko@gmail.com> wrote:

> Congratulations, Brian!
>
> On 15 Nov 2019, at 18:27, Rui Wang  wrote:
>
> Congrats!
>
>
> -Rui
>
> On Fri, Nov 15, 2019 at 8:16 AM Thomas Weise 
> wrote:
>
>> Congratulations!
>>
>>
>> On Fri, Nov 15, 2019 at 6:34 AM Connell O'Callaghan <
>> conne...@google.com> wrote:
>>
>>> Well done Brian!!!
>>>
>>> Kenn thank you for sharing
>>>
>>> On Fri, Nov 15, 2019 at 6:31 AM Cyrus Maden <
>>> cma...@google.com> wrote:
>>>
 Congrats Brian!

 On Fri, Nov 15, 2019 at 5:25 AM Ismaël Mejía <
 ieme...@gmail.com> wrote:

> Congratulations Brian!
> Happy to see this happening and eager to see more of your
> work!
>
> On Fri, Nov 15, 2019 at 11:02 AM Ankur Goenka <
> goe...@google.com> wrote:
> >
> > Congrats Brian!
> >
> > On Fri, Nov 15, 2019, 2:42 PM Jan Lukavský <
> je...@seznam.cz> wrote:
> >>
> >> Congrats Brian!
> >>
> >> On 11/15/19 9:58 AM, Reza Rokni wrote:
> >>
> >> Great news!
> >>
> >> On Fri, 15 Nov 2019 at 15:09, Gleb Kanterov <
> g...@spotify.com> wrote:
> >>>
> >>> Congratulations!
> >>>
> >>> On Fri, Nov 15, 2019 at 5:44 AM Valentyn Tymofieiev <
> valen...@google.com> wrote:
> 
>  Congratulations, Brian!
> 
>  On Thu, Nov 14, 2019 at 6:25 PM jincheng sun <
> sunjincheng...@gmail.com> wrote:
> >
> > Congratulation Brian!
> >
> > Best,
> > Jincheng
> >
> > Kyle Weaver  于2019年11月15日周五
> 上午7:19写道:
> >>
> >> Thanks for your contributions and congrats Brian!
> >>
> >> On Thu, Nov 14, 2019 at 3:14 PM Kenneth Knowles <
> k...@apache.org> wrote:
> >>>
> >>> Hi all,
> >>>
> >>> Please join me and the rest of the Beam PMC in
> welcoming a new committer: Brian Hulette
> >>>
> >>> Brian introduced himself to dev@ earlier this
> year and has been contributing since then. His contributions 
> to Beam
> include explorations of integration with Arrow, standardizing 
> coders,
> portability for schemas, and presentations at Beam events.
> >>>
> >>> In consideration of Brian's contributions, the
> Beam PMC trusts him with the responsibilities of a Beam 
> committer [1].

Re: Beam Dependency Check Report (2019-11-25)

2019-11-25 Thread Tomo Suzuki
Can anybody take action on this error?

> The service account was not found. The instance must be restarted via the
Compute Engine API to restore service account access.

Regards,
Tomo

On Mon, Nov 25, 2019 at 07:04 Apache Jenkins Server <
jenk...@builds.apache.org> wrote:

> ('Failed to retrieve
> http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/844138762903-comp...@developer.gserviceaccount.com/token
> from the Google Compute Enginemetadata service. Status: 404
> Response:\nb\'"The service account was not found. The instance must be
> restarted via the Compute Engine API to restore service account
> access."\'', )
>
> ('Failed to retrieve
> http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/844138762903-comp...@developer.gserviceaccount.com/token
> from the Google Compute Enginemetadata service. Status: 404
> Response:\nb\'"The service account was not found. The instance must be
> restarted via the Compute Engine API to restore service account
> access."\'', )
> A dependency update is high priority if it satisfies one of following
> criteria:
>
>- It has major versions update available, e.g.
>org.assertj:assertj-core 2.5.0 -> 3.10.0;
>
>
>- It is over 3 minor versions behind the latest version, e.g.
>org.tukaani:xz 1.5 -> 1.8;
>
>
>- The current version is behind the later version for over 180 days,
>e.g. com.google.auto.service:auto-service 2014-10-24 -> 2017-12-11.
>
> In Beam, we make a best-effort attempt at keeping all dependencies
> up-to-date. In the future, issues will be filed and tracked for these
> automatically, but in the meantime you can search for existing issues or
> open a new one. For more information: Beam Dependency Guide
> 
>
-- 
Regards,
Tomo


Beam Dependency Check Report (2019-11-25)

2019-11-25 Thread Apache Jenkins Server

 ('Failed to retrieve http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/844138762903-comp...@developer.gserviceaccount.com/token from the Google Compute Enginemetadata service. Status: 404 Response:\nb\'"The service account was not found. The instance must be restarted via the Compute Engine API to restore service account access."\'', )  ('Failed to retrieve http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/844138762903-comp...@developer.gserviceaccount.com/token from the Google Compute Enginemetadata service. Status: 404 Response:\nb\'"The service account was not found. The instance must be restarted via the Compute Engine API to restore service account access."\'', ) 
 A dependency update is high priority if it satisfies one of following criteria: 

 It has major versions update available, e.g. org.assertj:assertj-core 2.5.0 -> 3.10.0; 


 It is over 3 minor versions behind the latest version, e.g. org.tukaani:xz 1.5 -> 1.8; 


 The current version is behind the later version for over 180 days, e.g. com.google.auto.service:auto-service 2014-10-24 -> 2017-12-11. 

 In Beam, we make a best-effort attempt at keeping all dependencies up-to-date.
 In the future, issues will be filed and tracked for these automatically,
 but in the meantime you can search for existing issues or open a new one.

 For more information:  Beam Dependency Guide