Re: Patch release proposal

2024-03-27 Thread Robert Burke
+1 to a targeted patch release.

We did the same for the Go SDK a little while back. It would be good to see
what's different for a different SDK.

On Wed, Mar 27, 2024, 4:01 PM Robert Bradshaw via dev 
wrote:

> Given the severity of the breakage, and the simplicity of the workaround,
> I'm in favor of a patch release. I think we could do Python-only, which
> would make the process even more lightweight.
>
> On Wed, Mar 27, 2024 at 3:48 PM Jeff Kinard  wrote:
>
>> Hi all,
>>
>> Beam 2.55 was released with a bug that causes WriteToJson on Beam YAML to
>> fail when using the Java variant. This also affects any user attempting to
>> use the Xlang JsonWriteTransformProvider -
>> https://github.com/apache/beam/blob/master/sdks/java/io/json/src/main/java/org/apache/beam/sdk/io/json/providers/JsonWriteTransformProvider.java
>>
>> This is due to a change to
>> https://github.com/apache/beam/blob/master/sdks/java/io/json/build.gradle
>> that removed
>> a dependency on everit which also removed it from being packaged into the
>> expansion service JAR:
>> beam-sdks-java-extensions-sql-expansion-service-2.55.0.jar
>>
>> There is a temporary fix to disable the provider in Beam YAML:
>> https://github.com/apache/beam/pull/30777
>>
>> I think with the total loss of function, and a trivial fix, it is worth
>> creating a patch release of Beam 2.55 to include this fix.
>>
>> - Jeff
>>
>>


Re: container dev environment: go get issue

2024-03-22 Thread Robert Burke
Excellent!

These days go has become much simpler to deal with (nearly any folder with
a go.mod is a go project) but legacy GOPATH things remain to confuse
matters.

When I'm at a computer I'll see how necessary that line was for beam go
development with that env script.

I'd run one of the container building gradle tasks to be sure everything is
working properly do you don't run into a surprise later. It should be the
only spot building Go for the non-go SDKs at the moment.

On Fri, Mar 22, 2024, 6:00 AM Joey Tran  wrote:

> Woohoo it works! How could I forget the oldest trick in the book "just
> delete the problematic line"
>
> Thanks for the quick response. I am unblocked now :)
>
> On Fri, Mar 22, 2024 at 8:47 AM Robert Burke  wrote:
>
>> It's not clear to me why that's even requesting that package at all. I
>> would remove that 'go get' line.
>>
>> There's a different issue at play here too since it was written for
>> pre-module Go in mind. I'm unfamiliar with that script though.
>>
>> I'll take a proper look in a few hours.
>>
>> On Fri, Mar 22, 2024, 5:25 AM Joey Tran 
>> wrote:
>>
>>> Hi,
>>>
>>> I've been banging my head trying to get a dev environment working. I
>>> gave up trying to get a local python environment working after I got some
>>> weird clang errors and proto generation issues so I've been trying to just
>>> use the docker container by running `bash  start-build-env.sh` but I'm
>>> running into issues installing goavro.
>>>
>>> ```
>>>  => ERROR [7/8] RUN go get github.com/linkedin/goavro/v2
>>>   0.2s
>>> --
>>>  > [7/8] RUN go get github.com/linkedin/goavro/v2:
>>> 0.190 go: go.mod file not found in current directory or any parent
>>> directory.
>>> 0.190   'go get' is no longer supported outside a module.
>>> 0.190   To build and install a command, use 'go install' with a version,
>>> 0.190   like 'go install example.com/cmd@latest'
>>> 0.190   For more information, see
>>> https://golang.org/doc/go-get-install-deprecation
>>> 0.190   or run 'go help get' or 'go help install'.
>>> --
>>> Dockerfile:10
>>> 
>>>8 | ENV GOPATH
>>> /home/jtran/beam/sdks/go/examples/.gogradle/project_gopath
>>>9 | # This next command still runs as root causing the
>>> ~/.cache/go-build to be owned by root
>>>   10 | >>> RUN go get github.com/linkedin/goavro/v2
>>>   11 | RUN chown -R jtran:100 /home/jtran/.cache
>>>   12 |
>>> ```
>>>
>>> I have no familiarity go or go packacing and my googling hasn't yielded
>>> much insight.
>>>
>>> Any advice? I'm on an M2 mac, go version 1.21.1. I've tried setting
>>> GO111MODULE to various values as well.
>>>
>>


Re: container dev environment: go get issue

2024-03-22 Thread Robert Burke
It's not clear to me why that's even requesting that package at all. I
would remove that 'go get' line.

There's a different issue at play here too since it was written for
pre-module Go in mind. I'm unfamiliar with that script though.

I'll take a proper look in a few hours.

On Fri, Mar 22, 2024, 5:25 AM Joey Tran  wrote:

> Hi,
>
> I've been banging my head trying to get a dev environment working. I gave
> up trying to get a local python environment working after I got some weird
> clang errors and proto generation issues so I've been trying to just use
> the docker container by running `bash  start-build-env.sh` but I'm running
> into issues installing goavro.
>
> ```
>  => ERROR [7/8] RUN go get github.com/linkedin/goavro/v2
> 0.2s
> --
>  > [7/8] RUN go get github.com/linkedin/goavro/v2:
> 0.190 go: go.mod file not found in current directory or any parent
> directory.
> 0.190   'go get' is no longer supported outside a module.
> 0.190   To build and install a command, use 'go install' with a version,
> 0.190   like 'go install example.com/cmd@latest'
> 0.190   For more information, see
> https://golang.org/doc/go-get-install-deprecation
> 0.190   or run 'go help get' or 'go help install'.
> --
> Dockerfile:10
> 
>8 | ENV GOPATH
> /home/jtran/beam/sdks/go/examples/.gogradle/project_gopath
>9 | # This next command still runs as root causing the
> ~/.cache/go-build to be owned by root
>   10 | >>> RUN go get github.com/linkedin/goavro/v2
>   11 | RUN chown -R jtran:100 /home/jtran/.cache
>   12 |
> ```
>
> I have no familiarity go or go packacing and my googling hasn't yielded
> much insight.
>
> Any advice? I'm on an M2 mac, go version 1.21.1. I've tried setting
> GO111MODULE to various values as well.
>


Re: [DISCUSS] Processing time timers in "batch" (faster-than-wall-time [re]processing)

2024-02-28 Thread Robert Burke
Sounds like a different variation is either new timer types with those 
distinctions in mind, or additional configuration for ProcessingTime timers 
(defaulting to current behavior) to sort out those cases. Could potentially be 
extended to EventTime timers too for explicitly handling looping timer cases 
(eg. To signal: This DoFn's OnWindowExpiry method manages the consequences of 
this timer's effect of a Drain. Or similar. Or we put that as a additional 
configuration for OnWindowExpiry, along with Drain Awareness...)

I got curious and looked loosely at how Flink solves this problem:  
https://flink.apache.org/2022/11/25/optimising-the-throughput-of-async-sinks-using-a-custom-ratelimitingstrategy/

In short, an explicit rate limiting strategy. The surface glance indicates that 
it relies on local in memory state, but actual use of these things seems 
relegated to abstract classes (eg for Sinks and similar). It's not clear to me 
whether there is cross worker coordination happening there, or it's assumed to 
be all on a single machine anyway. I'm unfamiliar with how Flink operates, so I 
can't say.

I think I'd be happiest if we could build into Beam a mechanism / paired 
primitive where such a Cross Worker Communication Pair (the processor/server + 
DoFn client) could be built, but not purely be limited to Rate 
limiting/Throttling. Possibly mumble mumble StatePipe? But that feels like a 
harder problem for the time being.

Robert Burke

On 2024/02/28 08:25:35 Jan Lukavský wrote:
> 
> On 2/27/24 19:49, Robert Bradshaw via dev wrote:
> > On Tue, Feb 27, 2024 at 10:39 AM Jan Lukavský  wrote:
> >> On 2/27/24 19:22, Robert Bradshaw via dev wrote:
> >>> On Mon, Feb 26, 2024 at 11:45 AM Kenneth Knowles  wrote:
> >>>> Pulling out focus points:
> >>>>
> >>>> On Fri, Feb 23, 2024 at 7:21 PM Robert Bradshaw via dev 
> >>>>  wrote:
> >>>>> I can't act on something yet [...] but I expect to be able to [...] at 
> >>>>> some time in the processing-time future.
> >>>> I like this as a clear and internally-consistent feature description. It 
> >>>> describes ProcessContinuation and those timers which serve the same 
> >>>> purpose as ProcessContinuation.
> >>>>
> >>>> On Fri, Feb 23, 2024 at 7:21 PM Robert Bradshaw via dev 
> >>>>  wrote:
> >>>>> I can't think of a batch or streaming scenario where it would be 
> >>>>> correct to not wait at least that long
> >>>> The main reason we created timers: to take action in the absence of 
> >>>> data. The archetypal use case for processing time timers was/is "flush 
> >>>> data from state if it has been sitting there too long". For this use 
> >>>> case, the right behavior for batch is to skip the timer. It is actually 
> >>>> basically incorrect to wait.
> >>> Good point calling out the distinction between "I need to wait in case
> >>> there's more data." and "I need to wait for something external." We
> >>> can't currently distinguish between the two, but a batch runner can
> >>> say something definitive about the first. Feels like we need a new
> >>> primitive (or at least new signaling information on our existing
> >>> primitive).
> >> Runners signal end of data to a DoFn via (input) watermark. Is there a
> >> need for additional information?
> > Yes, and I agree that watermarks/event timestamps are a much better
> > way to track data completeness (if possible).
> >
> > Unfortunately processing timers don't specify if they're waiting for
> > additional data or external/environmental change, meaning we can't use
> > the (event time) watermark to determine whether they're safe to
> > trigger.
> +1
> 


Re: [DISCUSS] Processing time timers in "batch" (faster-than-wall-time [re]processing)

2024-02-27 Thread Robert Burke
utput timestamp, if so, it probably should be taken into account.
>>
>> Now, such semantics should be quite aligned with what we do in streaming
>> case and what users generally expect. The blocking part can be implemented
>> in @ProcessElement using buffer & timer, once there is need to wait, it can
>> be implemented in user code using plain sleep(). That is due to the
>> alignment between local time and definition of processing time. If we had
>> some reason to be able to run faster-than-wall-clock (as I'm still not in
>> favor of that), we could do that using ProcessContext.sleep(). Delaying
>> processing in the @ProcessElement should result in backpressuring and
>> backpropagation of this backpressure from the Throttle transform to the
>> sources as mentioned (of course this is only for the streaming case).
>>
>> Is there anything missing in such definition that would still require
>> splitting the timers into two distinct features?
>>
>>  Jan
>> On 2/26/24 21:22, Kenneth Knowles wrote:
>>
>> Yea I like DelayTimer, or SleepTimer, or WaitTimer or some such.
>>
>> OutputTime is always an event time timestamp so it isn't even allowed to
>> be set outside the window (or you'd end up with an element assigned to a
>> window that it isn't within, since OutputTime essentially represents
>> reserving the right to output an element with that timestamp)
>>
>> Kenn
>>
>> On Mon, Feb 26, 2024 at 3:19 PM Robert Burke  wrote:
>>
>>> Agreed that a retroactive behavior change would be bad, even if tied to
>>> a beam version change. I agree that it meshes well with the general theme
>>> of State & Timers exposing underlying primitives for implementing Windowing
>>> and similar. I'd say the distinction between the two might be additional
>>> complexity for users to grok, and would need to be documented well, as both
>>> operate in the ProcessingTime domain, but differently.
>>>
>>> What to call this new timer then? DelayTimer?
>>>
>>> "A DelayTimer sets an instant in ProcessingTime at which point
>>> computations can continue. Runners will prevent the EventTimer watermark
>>> from advancing past the set OutputTime until Processing Time has advanced
>>> to at least the provided instant to execute the timers callback. This can
>>> be used to allow the runner to constrain pipeline throughput with user
>>> guidance."
>>>
>>> I'd probably add that a timer with an output time outside of the window
>>> would not be guaranteed to fire, and that OnWindowExpiry is the correct way
>>> to ensure cleanup occurs.
>>>
>>> No solution to the Looping Timers on Drain problem here, but i think
>>> that's ultimately an orthogonal discussion, and will restrain my thoughts
>>> on that for now.
>>>
>>> This isn't a proposal, but exploring the solution space within our
>>> problem. We'd want to break down exactly what different and the same for
>>> the 3 kinds of timers...
>>>
>>>
>>>
>>>
>>> On Mon, Feb 26, 2024, 11:45 AM Kenneth Knowles  wrote:
>>>
>>>> Pulling out focus points:
>>>>
>>>> On Fri, Feb 23, 2024 at 7:21 PM Robert Bradshaw via dev <
>>>> dev@beam.apache.org> wrote:
>>>> > I can't act on something yet [...] but I expect to be able to [...]
>>>> at some time in the processing-time future.
>>>>
>>>> I like this as a clear and internally-consistent feature description.
>>>> It describes ProcessContinuation and those timers which serve the same
>>>> purpose as ProcessContinuation.
>>>>
>>>> On Fri, Feb 23, 2024 at 7:21 PM Robert Bradshaw via dev <
>>>> dev@beam.apache.org> wrote:
>>>> > I can't think of a batch or streaming scenario where it would be
>>>> correct to not wait at least that long
>>>>
>>>> The main reason we created timers: to take action in the absence of
>>>> data. The archetypal use case for processing time timers was/is "flush data
>>>> from state if it has been sitting there too long". For this use case, the
>>>> right behavior for batch is to skip the timer. It is actually basically
>>>> incorrect to wait.
>>>>
>>>> On Fri, Feb 23, 2024 at 3:54 PM Robert Burke 
>>>> wrote:
>>>> > It doesn't require a new primitive.
>>>>
>>>> IMO what's being proposed *is* a new primitive. I think it is a g

Re: [DISCUSS] Processing time timers in "batch" (faster-than-wall-time [re]processing)

2024-02-26 Thread Robert Burke
Agreed that a retroactive behavior change would be bad, even if tied to a
beam version change. I agree that it meshes well with the general theme of
State & Timers exposing underlying primitives for implementing Windowing
and similar. I'd say the distinction between the two might be additional
complexity for users to grok, and would need to be documented well, as both
operate in the ProcessingTime domain, but differently.

What to call this new timer then? DelayTimer?

"A DelayTimer sets an instant in ProcessingTime at which point computations
can continue. Runners will prevent the EventTimer watermark from advancing
past the set OutputTime until Processing Time has advanced to at least the
provided instant to execute the timers callback. This can be used to allow
the runner to constrain pipeline throughput with user guidance."

I'd probably add that a timer with an output time outside of the window
would not be guaranteed to fire, and that OnWindowExpiry is the correct way
to ensure cleanup occurs.

No solution to the Looping Timers on Drain problem here, but i think that's
ultimately an orthogonal discussion, and will restrain my thoughts on that
for now.

This isn't a proposal, but exploring the solution space within our problem.
We'd want to break down exactly what different and the same for the 3 kinds
of timers...




On Mon, Feb 26, 2024, 11:45 AM Kenneth Knowles  wrote:

> Pulling out focus points:
>
> On Fri, Feb 23, 2024 at 7:21 PM Robert Bradshaw via dev <
> dev@beam.apache.org> wrote:
> > I can't act on something yet [...] but I expect to be able to [...] at
> some time in the processing-time future.
>
> I like this as a clear and internally-consistent feature description. It
> describes ProcessContinuation and those timers which serve the same purpose
> as ProcessContinuation.
>
> On Fri, Feb 23, 2024 at 7:21 PM Robert Bradshaw via dev <
> dev@beam.apache.org> wrote:
> > I can't think of a batch or streaming scenario where it would be correct
> to not wait at least that long
>
> The main reason we created timers: to take action in the absence of data.
> The archetypal use case for processing time timers was/is "flush data from
> state if it has been sitting there too long". For this use case, the right
> behavior for batch is to skip the timer. It is actually basically incorrect
> to wait.
>
> On Fri, Feb 23, 2024 at 3:54 PM Robert Burke  wrote:
> > It doesn't require a new primitive.
>
> IMO what's being proposed *is* a new primitive. I think it is a good
> primitive. It is the underlying primitive to ProcessContinuation. It
> would be user-friendly as a kind of timer. But if we made this the behavior
> of processing time timers retroactively, it would break everyone using them
> to flush data who is also reprocessing data.
>
> There's two very different use cases ("I need to wait, and block data" vs
> "I want to act without data, aka NOT wait for data") and I think we should
> serve both of them, but it doesn't have to be with the same low-level
> feature.
>
> Kenn
>
>
> On Fri, Feb 23, 2024 at 7:21 PM Robert Bradshaw via dev <
> dev@beam.apache.org> wrote:
>
>> On Fri, Feb 23, 2024 at 3:54 PM Robert Burke  wrote:
>> >
>> > While I'm currently on the other side of the fence, I would not be
>> against changing/requiring the semantics of ProcessingTime constructs to be
>> "must wait and execute" as such a solution, and enables the Proposed
>> "batch" process continuation throttling mechanism to work as hypothesized
>> for both "batch" and "streaming" execution.
>> >
>> > There's a lot to like, as it leans Beam further into the unification of
>> Batch and Stream, with one fewer exception (eg. unifies timer experience
>> further). It doesn't require a new primitive. It probably matches more with
>> user expectations anyway.
>> >
>> > It does cause looping timer execution with processing time to be a
>> problem for Drains however.
>>
>> I think we have a problem with looping timers plus drain (a mostly
>> streaming idea anyway) regardless.
>>
>> > I'd argue though that in the case of a drain, we could updated the
>> semantics as "move watermark to infinity"  "existing timers are executed,
>> but new timers are ignored",
>>
>> I don't like the idea of dropping timers for drain. I think correct
>> handling here requires user visibility into whether a pipeline is
>> draining or not.
>>
>> > and ensure/and update the requirements around OnWindowExpiration
>> callbacks to be a bit more insistent on being implemented for correct
>> execution, whic

Re: [DISCUSS] Processing time timers in "batch" (faster-than-wall-time [re]processing)

2024-02-23 Thread Robert Burke
While I'm currently on the other side of the fence, I would not be against 
changing/requiring the semantics of ProcessingTime constructs to be "must wait 
and execute" as such a solution, and enables the Proposed "batch" process 
continuation throttling mechanism to work as hypothesized for both "batch" and 
"streaming" execution. 

There's a lot to like, as it leans Beam further into the unification of Batch 
and Stream, with one fewer exception (eg. unifies timer experience further). It 
doesn't require a new primitive. It probably matches more with user 
expectations anyway.

It does cause looping timer execution with processing time to be a problem for 
Drains however.

I'd argue though that in the case of a drain, we could updated the semantics as 
"move watermark to infinity"  "existing timers are executed, but new timers are 
ignored", and ensure/and update the requirements around OnWindowExpiration 
callbacks to be a bit more insistent on being implemented for correct 
execution, which is currently the only "hard" signal to the SDK side that the 
window's work is guaranteed to be over, and remaining state needs to be 
addressed by the transform or be garbage collected. This remains critical for 
developing a good pattern for ProcessingTime timers within a Global Window too.

On 2024/02/23 19:48:22 Robert Bradshaw via dev wrote:
> Thanks for bringing this up.
> 
> My position is that both batch and streaming should wait for
> processing time timers, according to local time (with the exception of
> tests that can accelerate this via faked clocks).
> 
> Both ProcessContinuations delays and ProcessingTimeTimers are IMHO
> isomorphic, and can be implemented in terms of each other (at least in
> one direction, and likely the other). Both are an indication that I
> can't act on something yet due to external constraints (e.g. not all
> the data has been published, or I lack sufficient capacity/quota to
> push things downstream) but I expect to be able to (or at least would
> like to check again) at some time in the processing-time future. I
> can't think of a batch or streaming scenario where it would be correct
> to not wait at least that long (even in batch inputs, e.g. suppose I'm
> tailing logs and was eagerly started before they were fully written,
> or waiting for some kind of (non-data-dependent) quiessence or other
> operation to finish).
> 
> 
> On Fri, Feb 23, 2024 at 12:36 AM Jan Lukavský  wrote:
> >
> > For me it always helps to seek analogy in our physical reality. Stream
> > processing actually has quite a good analogy for both event-time and
> > processing-time - the simplest model for this being relativity theory.
> > Event-time is the time at which events occur _at distant locations_. Due
> > to finite and invariant speed of light (which is actually really
> > involved in the explanation why any stream processing is inevitably
> > unordered) these events are observed (processed) at different times
> > (processing time, different for different observers). It is perfectly
> > possible for an observer to observe events at a rate that is higher than
> > one second per second. This also happens in reality for observers that
> > travel at relativistic speeds (which might be an analogy for fast -
> > batch - (re)processing). Besides the invariant speed, there is also
> > another invariant - local clock (wall time) always ticks exactly at the
> > rate of one second per second, no matter what. It is not possible to
> > "move faster or slower" through (local) time.
> >
> > In my understanding the reason why we do not put any guarantees or
> > bounds on the delay of firing processing time timers is purely technical
> > - the processing is (per key) single-threaded, thus any timer has to
> > wait before any element processing finishes. This is only consequence of
> > a technical solution, not something fundamental.
> >
> > Having said that, my point is that according to the above analogy, it
> > should be perfectly fine to fire processing time timers in batch based
> > on (local wall) time only. There should be no way of manipulating this
> > local time (excluding tests). Watermarks should be affected the same way
> > as any buffering in a state that would happen in a stateful DoFn (i.e.
> > set timer holds output watermark). We should probably pay attention to
> > looping timers, but it seems possible to define a valid stopping
> > condition (input watermark at infinity).
> >
> >   Jan
> >
> > On 2/22/24 19:50, Kenneth Knowles wrote:
> > > Forking this thread.
> > >
> > > The state of processing time timers in this mode of processing is not
> > > satisfactory and is discussed a lot but we should make everything
> > > explicit.
> > >
> > > Currently, a state and timer DoFn has a number of logical watermarks:
> > > (apologies for fixed width not coming through in email lists). Treat
> > > timers as a back edge.
> > >
> > > input --(A)(C)--> ParDo(DoFn) (D)---> output
> > > ^   

Re: [DISCUSS] Processing time timers in "batch" (faster-than-wall-time [re]processing)

2024-02-22 Thread Robert Burke
This is a "timely" discussion because my next step for Prism is to address 
ProcessingTime.

The description of the watermarks matches my understanding and how it's 
implemented so far in Prism [0], where the "stage" contains one or more 
transforms to be executed by a worker.

My current thinking on processing time is in the issue tracker [1], largely 
focused on quite the opposite case than throttleling: for ensuring fast 
execution for pipelines with TestStream. As TestStream is for tests, and tests 
should execute quickly, there's no reason to do anything but synthetically 
advance the processing time. However, this only gates the Runner actions for 
processing time, not the Worker/SDK actions for processing time.

Of note during my explorations there was that there are two places 
ProcessingTime is invoked: a relatively scheduled resume for 
ProcessContinuations, and an absolute time for ProcessingTime timers. It's much 
easier to ignore a relative time, but absolute times are a bit harder, since 
it's never going to based on what the Runner time is, which will be skewed from 
SDK time, since there's no passing of Processing time from Runner to SDK.

I agree that the main purpose of ProcessingTime timers is to timeout state for 
"Streaming" execution, and similarly having OnWindowExpiration for guaranteeing 
any state is addressed for EventTime timer handling within a window. I also 
agree that a "Batch" execution shouldn't wait for ProcessingTime Timers,  but 
should still execute OnWindowExpirations. Notably, the existing behavior of a 
ProcessingTime timer is not to block execution, but to schedule potential 
execution. It would be wrong to block in otherwords.

Similarly, ProcessContinuations only declare a suggested resume time. It's 
still up to the DoFn returning the ProcessContinuation, assuming it's time 
dependant, to actually check the time for it's desired behavior. It's not a 
block, but an indication of when additional work might be available, and that 
it's probably a waste of time for the runner to schedule the work sooner than 
the recommended delay.

What's lacking is a Beam notion of Runner directed cross worker global state I 
think.

I don't know what that looks like exactly though in a way that would useful for 
more than simply a throttle. One could imagine a Special transform that is 
periodically executed on SDK workers in response to something and a Special 
SideInput that is how that information is propagated to other transforms (like 
the throttle transform). But that just sounds like a variant of Slowly Changing 
SideInputs, instead of allowing the Special transform to direct the runner's 
sharding and management of some other transforms. Hard to see how useful that 
is outside of the throttle though.

We could add a Block primitive, that does exactly that. Similar to timers, but 
execution SDK side is held until the Runner sends an Unblock signal for a given 
bundle instruction+blockID combo back to the SDK. But again that seems only 
useful for a central throttleing notion. Technically Google's internal Flume 
batch processor has the notion of a FlumeThrottle to solve exactly this problem.

I'd be happiest if we could figure out a less operationally specific primitive, 
but if not, a token bucket based BeamThrottle would be useful in batch and 
streaming, and shouldn't be too difficult to add to most runners and SDKs 
(though the amount of work will of course vary).

I've gotten away from the core topic. My opinion is "ProcessingTime Timers 
Shouldn't Block Execution" and "We should figure out the best central primitive 
to manage this class of concept".

Robert Burke
Beam Go Busybody

[0] 
https://github.com/apache/beam/blob/11f9bce485c4f6fe466ff4bf5073d2414e43678c/sdks/go/pkg/beam/runners/prism/internal/engine/elementmanager.go#L1253-L1331
[1] https://github.com/apache/beam/issues/30083


On 2024/02/22 18:50:10 Kenneth Knowles wrote:
> Forking this thread.
> 
> The state of processing time timers in this mode of processing is not
> satisfactory and is discussed a lot but we should make everything explicit.
> 
> Currently, a state and timer DoFn has a number of logical watermarks:
> (apologies for fixed width not coming through in email lists). Treat timers
> as a back edge.
> 
> input --(A)(C)--> ParDo(DoFn) (D)---> output
> ^  |
> |--(B)-|
>timers
> 
> 
> (A) Input Element watermark: this is the watermark that promises there is
> no incoming element with a timestamp earlier than it. Each input element's
> timestamp holds this watermark. Note that *event time timers firing is
> according to this watermark*. But a runner commits changes to this
> watermark *whenever it wants*, in a way that can be consistent. So the
> runner can absolute p

Re: Throttle PTransform

2024-02-21 Thread Robert Burke
ransform. In general, a need for such
>> split
>> >> > triggers doubts in me. This signals that either
>> >> >
>> >> >   a) the transform does something is should not, or
>> >> >
>> >> >   b) Beam model is not complete in terms of being "unified"
>> >> >
>> >> > The problem that is described in the document is that in the batch
>> case
>> >> > timers are not fired appropriately.
>> >>
>> >> +1. The underlying flaw is that processing time timers are not handled
>> >> correctly in batch, but should be (even if it means keeping workers
>> >> idle?). We should fix this.
>> >>
>> >> > This is actually on of the
>> >> > motivations that led to introduction of @RequiresTimeSortedInput
>> >> > annotation and, though mentioned years ago as a question, I do not
>> >> > remember what arguments were used against enforcing sorting inputs by
>> >> > timestamp in the batch stateful DoFn as a requirement in the model.
>> That
>> >> > would enable the appropriate firing of timers while preserving the
>> batch
>> >> > invariant which is there are no late data allowed. IIRC there are
>> >> > runners that do this sorting by default (at least the sorting, not
>> sure
>> >> > about the timers, but once inputs are sorted, firing timers is
>> simple).
>> >> >
>> >> > A different question is if this particular transform should maybe
>> fire
>> >> > not by event time, but rather processing time?
>> >>
>> >> Yeah, I was reading all of these as processing time. Throttling by
>> >> event time doesn't make much sense.
>> >>
>> >> > On 2/21/24 03:00, Robert Burke wrote:
>> >> > > Thanks for the design Damon! And thanks for collaborating with me
>> on getting a high level textual description of the key implementation idea
>> down in writing. I think the solution is pretty elegant.
>> >> > >
>> >> > > I do have concerns about how different Runners might handle
>> ProcessContinuations for the Bounded Input case. I know Dataflow famously
>> has two different execution modes under the hood, but I agree with the
>> principle that ProcessContinuation.Resume should largely be in line with
>> the expected delay, though it's by no means guaranteed AFAIK.
>> >> > >
>> >> > > We should also ensure this is linked from
>> https://s.apache.org/beam-design-docs if not already.
>> >> > >
>> >> > > Robert Burke
>> >> > > Beam Go Busybody
>> >> > >
>> >> > > On 2024/02/20 14:00:00 Damon Douglas wrote:
>> >> > >> Hello Everyone,
>> >> > >>
>> >> > >> The following describes a Throttle PTransform that holds element
>> throughput
>> >> > >> to minimize downstream API overusage. Thank you for reading and
>> your
>> >> > >> valuable input.
>> >> > >>
>> >> > >> https://s.apache.org/beam-throttle-transform
>> >> > >>
>> >> > >> Best,
>> >> > >>
>> >> > >> Damon
>> >> > >>
>>
>


Re: Throttle PTransform

2024-02-20 Thread Robert Burke
Thanks for the design Damon! And thanks for collaborating with me on getting a 
high level textual description of the key implementation idea down in writing. 
I think the solution is pretty elegant. 

I do have concerns about how different Runners might handle 
ProcessContinuations for the Bounded Input case. I know Dataflow famously has 
two different execution modes under the hood, but I agree with the principle 
that ProcessContinuation.Resume should largely be in line with the expected 
delay, though it's by no means guaranteed AFAIK.

We should also ensure this is linked from https://s.apache.org/beam-design-docs 
if not already. 

Robert Burke
Beam Go Busybody

On 2024/02/20 14:00:00 Damon Douglas wrote:
> Hello Everyone,
> 
> The following describes a Throttle PTransform that holds element throughput
> to minimize downstream API overusage. Thank you for reading and your
> valuable input.
> 
> https://s.apache.org/beam-throttle-transform
> 
> Best,
> 
> Damon
> 


Re: [API PROPOSAL] PTransform.getURN, toProto, etc, for Java

2024-02-15 Thread Robert Burke
+1

While the current Go SDK has always been portability first it was designed
with a goal of enabling it to back out of that at the time, so it's fully
on a broad vertical slice of things to translate to protos and back again,
leading to difficulties when adding a new core transform.

I have an experimental hobby implementation of a Go SDK for prototyping
things (mostly seeing if Go Generics can make a pipeline compile time
typesafe, and the answer is yes... but that's a different email) and went
with emitting out a FunctionSpec, (urn and payload), the env ID, and
UniqueName, while inputs and outputs were handled with common code.

I still kept Execution side translation to be graph based at the time,
because of the lost type information, which required additional graph
context to build the execution side with the right types (eg for SDK side
source, sink, and flatten handling).

So I question if full symmetry is required. Eg. There's no reason for
ExternalTransforms to be converted back on execution side, or for GBKs
(usually that is, I'm looking at you Typescript SDK!). And conversely,
there are "Execution Side Only" transforms that are never directly written
by a pipeline or transform author, but are necessary to execute SDK side
(combine or SDF components for example), even though those have single user
side constructs.

That just implies that the toProto and fromProto parts are separable though.

But that's just that specific experimental design for that specific
languages affordances.

It's definitely a big plus to be able to see all the bits for a single
transform in one file, instead of trying to find the 5-8 different places
once must add a registration for it. More so in Java where such handler
registrations can be done via class annotations!

Robert Burke
Beam Go Busybody

On Thu, Feb 15, 2024, 10:37 AM Robert Bradshaw via dev 
wrote:

> On Wed, Feb 14, 2024 at 10:28 AM Kenneth Knowles  wrote:
> >
> > Hi all,
> >
> > TL;DR I want to add some API like PTransform.getURN, toProto and
> fromProto, etc. to the Java SDK. I want to do this so that making a
> PTransform support portability is a natural part of writing the transform
> and not a totally separate thing with tons of boilerplate.
> >
> > What do you think?
>
> Huge +1 to this direction.
>
> IMHO one of the most fundamental things about Beam is its model.
> Originally this was only expressed in a specific SDK (Java) and then
> got ported to others, but now that we have portability it's expressed
> in a language-independent way.
>
> The fact that we keep these separate in Java is not buying us
> anything, and causes a huge amount of boilerplate that'd be great to
> remove, as well as making the essential model more front-and-center.
>
> > I think a particular API can be sorted out most easily in code (which I
> will prepare after gathering some feedback).
> >
> > We already have all the translation logic written, and porting a couple
> transforms to it will ensure the API has everything we need. We can refer
> to Python and Go for API ideas as well.
> >
> > Lots of context below, but you can skip it...
> >
> > -
> >
> > When we first created the portability framework, we wanted the SDKs to
> be "standalone" and not depend on portability. We wanted portability to be
> an optional plugin that users could opt in to. That is totally the opposite
> now. We want portability to be the main place where Beam is defined, and
> then SDKs make that available in language idiomatic ways.
> >
> > Also when we first created the framework, we were experimenting with
> different serialization approaches and we wanted to be independent of
> protobuf and gRPC if we could. But now we are pretty committed and it would
> be a huge lift to use anything else.
> >
> > Finally, at the time we created the portability framework, we designed
> it to allow composites to have URNs and well-defined specs, rather than
> just be language-specific subgraphs, but we didn't really plan to make this
> easy.
> >
> > For all of the above, most users depend on portability and on proto. So
> separating them is not useful and just creates LOTS of boilerplate and
> friction for making new well-defined transforms.
> >
> > Kenn
>


Re: [RESULT] [VOTE] Vendored Dependencies Release beam-vendor-grpc-1-60-1:0.2

2024-02-15 Thread Robert Burke
I'll handle the PMC steps for this.
Thanks!

On 2024/02/15 09:37:55 Sam Whittle wrote:
> I'm happy to announce that we have unanimously approved the vendored release
> 
> of beam-vendor-grpc-1-60-1:0.2 .
> 
> There are 5 approving votes, 4 of which are binding:
> 
> * chamikara@ Chamikara Madhusanka Jayalath
> 
> * kenn@ Kenneth Knowles
> 
> * lostluck@ Robert Burke
> 
> * tvalentyn@ Valentyn Tymofieiev
> 
> * yhu@ Yi Hu (non-binding)
> 
> There are no disapproving votes.
> 
> Thanks everyone!
> 


[ANNOUNCE] Beam 2.54.0 Released

2024-02-14 Thread Robert Burke
The Apache Beam Team is pleased to announce the release of version 2.54.0.

You can download the release here:

https://beam.apache.org/get-started/downloads/

This release includes bug fixes, features, and improvements detailed on the
Beam Blog: https://beam.apache.org/blog/beam-2.54.0/
and the Github release page
https://github.com/apache/beam/releases/tag/v2.54.0

Thanks to everyone who contributed to this release, and we hope you enjoy
using Beam 2.54.0.

-- Robert Burke, on behalf of the Apache Beam Team.


Re: [RESULT] [VOTE] Release 2.54.0, release candidate #2

2024-02-14 Thread Robert Burke
The release is now complete 

https://beam.apache.org/blog/beam-2.54.0/

Please share and promote.

I'll be working on the last odds and ends, but the release is now out.

Robert Burke
Beam 2.54.0 Release Manager

On 2024/02/14 17:36:31 Robert Burke wrote:
> I'm happy to announce that we have unanimously approved this release.
> 
> There are 9 approving votes, 5 of which are binding:
> * Jan Lukavský
> * Chamikara Jayalath
> * Valentyn Tymofieiev
> * Robert Bradshaw
> * Robert Burke
> 
> There are no disapproving votes.
> 
> Thanks everyone!
> Robert Burke
> Beam 2.54.0 Release Manager
> 


[RESULT] [VOTE] Release 2.54.0, release candidate #2

2024-02-14 Thread Robert Burke
I'm happy to announce that we have unanimously approved this release.

There are 9 approving votes, 5 of which are binding:
* Jan Lukavský
* Chamikara Jayalath
* Valentyn Tymofieiev
* Robert Bradshaw
* Robert Burke

There are no disapproving votes.

Thanks everyone!
Robert Burke
Beam 2.54.0 Release Manager


Re: [VOTE] Release 2.54.0, release candidate #2

2024-02-14 Thread Robert Burke
And with that, we have sufficient conditions to declare that RC2 has met 
community approval. The vote is now closed.

Robert Burke
Beam 2.54.0 Release Manager

On 2024/02/14 17:20:46 Robert Bradshaw via dev wrote:
> +1 (binding)
> 
> We've done the validation we can for now, let's not hold up the
> release any longer.
> 
> (For those curious, there may be a brief period of time where Dataflow
> pipelines with 2.54 still default to Runner V1 in some regions as
> things roll out, but we expect this to be fully resolved next week.)
> 
> On Fri, Feb 9, 2024 at 6:28 PM Robert Burke  wrote:
> >
> > I can agree to that Robert Bradshaw. Thank you for letting the community 
> > know.
> >
> > (Disclaimer: I am on the Dataflow team myself, but do try to keep my hats 
> > separated when I'm release manager).
> >
> > It would be bad for Beam users who use Dataflow to try to use the release 
> > but be unaware of the switch. I'm in favour of the path of least user 
> > issues, wherever they're cropping up from.
> >
> > As a heads up, I'll likely not address this thread until Wednesday morning 
> > (or if there's a sooner update) as a result of this request.
> >
> > Separately:
> > + 1 (binding)
> >  I've done a few of the quickstarts from the validation sheets and updated 
> > my own Beam Go code. Other than a non-blocking update to the Go wordcount 
> > quickstart, I didn't run into any issues.
> >
> > Robert Burke
> > Beam 2.54.0 Release Manager
> >
> > On 2024/02/10 01:41:12 Robert Bradshaw via dev wrote:
> > > I validated that the release artifacts are all correct, tested some simple
> > > Python and Yaml pipelines. Everything is looking good so far.
> > >
> > > However, could I ask that you hold this vote open a little longer? We've
> > > got some Dataflow service side changes that relate to 2.54 being the first
> > > release where Runner v2 is the default for Java (big change on our side),
> > > and could use a bit of additional time to verify we have everything lined
> > > up correctly. We should be able to finish the validation early/mid next
> > > week at the latest.
> > >
> > > - Robert
> > >
> > >
> > > On Fri, Feb 9, 2024 at 2:57 PM Yi Hu via dev  wrote:
> > >
> > > > Also tested with GCP IO performance benchmark [1]. Passed other than
> > > > SpannerIO where the benchmark failed due to issues in the test suite 
> > > > itself
> > > > [2], not related to Beam.
> > > >
> > > > +1 but I had voted for another validation suite before for this RC
> > > >
> > > > [1]
> > > > https://github.com/GoogleCloudPlatform/DataflowTemplates/tree/main/it/google-cloud-platform
> > > > [2] https://github.com/GoogleCloudPlatform/DataflowTemplates/issues/1326
> > > >
> > > > On Fri, Feb 9, 2024 at 9:43 AM Valentyn Tymofieiev via dev <
> > > > dev@beam.apache.org> wrote:
> > > >
> > > >> +1.
> > > >>
> > > >> Checked postcommit test results for Python SDK, and exercised a couple 
> > > >> of
> > > >> Datadow scenarios.
> > > >>
> > > >> On Thu, Feb 8, 2024, 14:07 Svetak Sundhar via dev 
> > > >> wrote:
> > > >>
> > > >>> +1 (Non-Binding)
> > > >>>
> > > >>> Tested with Python SDK on DirectRunner and Dataflow Runner
> > > >>>
> > > >>>
> > > >>> Svetak Sundhar
> > > >>>
> > > >>>   Data Engineer
> > > >>> s vetaksund...@google.com
> > > >>>
> > > >>>
> > > >>>
> > > >>> On Thu, Feb 8, 2024 at 12:45 PM Chamikara Jayalath via dev <
> > > >>> dev@beam.apache.org> wrote:
> > > >>>
> > > >>>> +1 (binding)
> > > >>>>
> > > >>>> Tried out Java/Python multi-lang jobs and upgrading BQ/Kafka 
> > > >>>> transforms
> > > >>>> from 2.53.0 to 2.54.0 using the Transform Service.
> > > >>>>
> > > >>>> Thanks,
> > > >>>> Cham
> > > >>>>
> > > >>>> On Wed, Feb 7, 2024 at 5:52 PM XQ Hu via dev 
> > > >>>> wrote:
> > > >>>>
> > > >>>>> +1 (non-binding)
> > > >>>>>
&g

Re: [VOTE] Vendored Dependencies Release

2024-02-14 Thread Robert Burke
+1 (binding)

On Wed, Feb 14, 2024, 7:35 AM Yi Hu via dev  wrote:

> +1 (non-binding)
>
> checked artifact packages not leaking namespace (or under
> org.apache.beam.vendor.grpc.v1p60p1) and the tests in
> https://github.com/apache/beam/pull/30212
>
>
>
>
> On Tue, Feb 13, 2024 at 4:29 AM Sam Whittle  wrote:
>
>> Hi,
>> Sorry I missed that close step. Done!
>> Sam
>>
>> On Mon, Feb 12, 2024 at 8:32 PM Yi Hu via dev 
>> wrote:
>>
>>> Hi,
>>>
>>> I am trying to open "
>>> https://repository.apache.org/content/repositories/orgapachebeam-1369/;
>>> but get "[id=orgapachebeam-1369] exists but is not exposed." It seems the
>>> staging repository needs to be closed to have it available to public: [1]
>>>
>>> [1]
>>> https://docs.google.com/document/d/1ztEoyGkqq9ie5riQxRtMuBu3vb6BUO91mSMn1PU0pDA/edit?disco=vHX80XE
>>>
>>> On Mon, Feb 12, 2024 at 1:44 PM Chamikara Jayalath via dev <
>>> dev@beam.apache.org> wrote:
>>>
 +1 (binding)

 Thanks,
 Cham

 On Fri, Feb 9, 2024 at 5:25 AM Sam Whittle 
 wrote:

> Please review the release of the following artifacts that we vendor,
> following the process [5]:
>
>  * beam-vendor-grpc-1-60-1:0.2
>
> Hi everyone,
>
> Please review and vote on the release candidate #1 for the version
> beam-vendor-grpc-1-60-1:0.2 as follows:
>
> [ ] +1, Approve the release
>
> [ ] -1, Do not approve the release (please provide specific comments)
>
>
> The complete staging area is available for your review, which includes:
>
> * the official Apache source release to be deployed to dist.apache.org
> [1], which is signed with the key with fingerprint FCFD152811BF1578 [2],
>
> * all artifacts to be deployed to the Maven Central Repository [3],
>
> * commit hash "2d08b32e674a1046ba7be0ae5f1e4b7b05b73488" [4].
>
> The vote will be open for at least 72 hours. It is adopted by majority
> approval, with at least 3 PMC affirmative votes.
>
> Thanks,
>
> Sam
>
> [1] https://dist.apache.org/repos/dist/dev/beam/vendor/
>
> [2] https://dist.apache.org/repos/dist/release/beam/KEYS
>
> [3]
> https://repository.apache.org/content/repositories/orgapachebeam-1369/
>
> [4]
> https://github.com/apache/beam/commit/2d08b32e674a1046ba7be0ae5f1e4b7b05b73488
>
> [5] https://s.apache.org/beam-release-vendored-artifacts
>



Re: [VOTE] Release 2.54.0, release candidate #2

2024-02-09 Thread Robert Burke
I can agree to that Robert Bradshaw. Thank you for letting the community know.

(Disclaimer: I am on the Dataflow team myself, but do try to keep my hats 
separated when I'm release manager).

It would be bad for Beam users who use Dataflow to try to use the release but 
be unaware of the switch. I'm in favour of the path of least user issues, 
wherever they're cropping up from.

As a heads up, I'll likely not address this thread until Wednesday morning (or 
if there's a sooner update) as a result of this request.

Separately:
+ 1 (binding)
 I've done a few of the quickstarts from the validation sheets and updated my 
own Beam Go code. Other than a non-blocking update to the Go wordcount 
quickstart, I didn't run into any issues.

Robert Burke
Beam 2.54.0 Release Manager

On 2024/02/10 01:41:12 Robert Bradshaw via dev wrote:
> I validated that the release artifacts are all correct, tested some simple
> Python and Yaml pipelines. Everything is looking good so far.
> 
> However, could I ask that you hold this vote open a little longer? We've
> got some Dataflow service side changes that relate to 2.54 being the first
> release where Runner v2 is the default for Java (big change on our side),
> and could use a bit of additional time to verify we have everything lined
> up correctly. We should be able to finish the validation early/mid next
> week at the latest.
> 
> - Robert
> 
> 
> On Fri, Feb 9, 2024 at 2:57 PM Yi Hu via dev  wrote:
> 
> > Also tested with GCP IO performance benchmark [1]. Passed other than
> > SpannerIO where the benchmark failed due to issues in the test suite itself
> > [2], not related to Beam.
> >
> > +1 but I had voted for another validation suite before for this RC
> >
> > [1]
> > https://github.com/GoogleCloudPlatform/DataflowTemplates/tree/main/it/google-cloud-platform
> > [2] https://github.com/GoogleCloudPlatform/DataflowTemplates/issues/1326
> >
> > On Fri, Feb 9, 2024 at 9:43 AM Valentyn Tymofieiev via dev <
> > dev@beam.apache.org> wrote:
> >
> >> +1.
> >>
> >> Checked postcommit test results for Python SDK, and exercised a couple of
> >> Datadow scenarios.
> >>
> >> On Thu, Feb 8, 2024, 14:07 Svetak Sundhar via dev 
> >> wrote:
> >>
> >>> +1 (Non-Binding)
> >>>
> >>> Tested with Python SDK on DirectRunner and Dataflow Runner
> >>>
> >>>
> >>> Svetak Sundhar
> >>>
> >>>   Data Engineer
> >>> s vetaksund...@google.com
> >>>
> >>>
> >>>
> >>> On Thu, Feb 8, 2024 at 12:45 PM Chamikara Jayalath via dev <
> >>> dev@beam.apache.org> wrote:
> >>>
> >>>> +1 (binding)
> >>>>
> >>>> Tried out Java/Python multi-lang jobs and upgrading BQ/Kafka transforms
> >>>> from 2.53.0 to 2.54.0 using the Transform Service.
> >>>>
> >>>> Thanks,
> >>>> Cham
> >>>>
> >>>> On Wed, Feb 7, 2024 at 5:52 PM XQ Hu via dev 
> >>>> wrote:
> >>>>
> >>>>> +1 (non-binding)
> >>>>>
> >>>>> Validated with a simple RunInference Python pipeline:
> >>>>> https://github.com/google/dataflow-ml-starter/actions/runs/7821639833/job/21339032997
> >>>>>
> >>>>> On Wed, Feb 7, 2024 at 7:10 PM Yi Hu via dev 
> >>>>> wrote:
> >>>>>
> >>>>>> +1 (non-binding)
> >>>>>>
> >>>>>> Validated with Dataflow Template:
> >>>>>> https://github.com/GoogleCloudPlatform/DataflowTemplates/pull/1317
> >>>>>>
> >>>>>> Regards,
> >>>>>>
> >>>>>> On Wed, Feb 7, 2024 at 11:18 AM Ritesh Ghorse via dev <
> >>>>>> dev@beam.apache.org> wrote:
> >>>>>>
> >>>>>>> +1 (non-binding)
> >>>>>>>
> >>>>>>> Ran a few batch and streaming examples for Python SDK on Dataflow
> >>>>>>> Runner
> >>>>>>>
> >>>>>>> Thanks!
> >>>>>>>
> >>>>>>> On Wed, Feb 7, 2024 at 4:08 AM Jan Lukavský  wrote:
> >>>>>>>
> >>>>>>>> +1 (binding)
> >>>>>>>>
> >>>>>>>> Validated Java SDK with Flink runner.
> >>>>>>>>
> >>>>>>>>  Jan
> &g

Re: Playground: File Explorer?

2024-02-08 Thread Robert Burke
I think in principle we could update the release process to do that, but it
would require adjusting how we build the staging version of the playground
to accommodate how each SDK handles RCs.

At present it's very geared towards building from released versions.

On Thu, Feb 8, 2024, 9:18 AM Joey Tran  wrote:

> Ah that makes sense. Does the new version of Playground get staged for
> release validation?
>
> On Thu, Feb 8, 2024 at 12:08 PM Robert Burke  wrote:
>
>> We redeploy the playground along with the release, so once 2.54.0 RC2 has
>> been validated and voted on, I'll be redeploying it with 2.54.0.
>>
>> On Thu, Feb 8, 2024, 7:18 AM Joey Tran  wrote:
>>
>>> Here's two:
>>>
>>> https://play.beam.apache.org/?path=SDK_PYTHON_MultipleOutputPardo=python
>>> https://play.beam.apache.org/?path=SDK_PYTHON_WordCount=python
>>>
>>> Also, how often does playground get redeployed? I put up a PR[1] that's
>>> been merged for try to reduce the amount of logging these examples produce
>>> and I'm not sure if it's not working or if playground just hasn't been
>>> redeployed in the last month or so
>>>
>>> [1] https://github.com/apache/beam/pull/29948
>>>
>>>
>>> On Thu, Feb 8, 2024 at 10:12 AM XQ Hu via dev 
>>> wrote:
>>>
>>>> Can you provide which example you are referring to? I checked a few
>>>> examples and usually we use beam.Map(print) to display some output values.
>>>>
>>>> On Wed, Feb 7, 2024 at 8:55 PM Joey Tran 
>>>> wrote:
>>>>
>>>>> Hey all,
>>>>>
>>>>> I've been really trying to use Playground for educating new Beam users
>>>>> but it feels like there's something missing. A lot of examples (e.g.
>>>>> Multiple ParDo Outputs) for at least the python API don't seem to do
>>>>> anything observable. For example, the Multiple ParDo Outputs example 
>>>>> writes
>>>>> to a file but is there any way to actually look at written out files? I
>>>>> feel like maybe I'm missing something.
>>>>>
>>>>> Best,
>>>>> Joey
>>>>>
>>>>


Re: [DESIGN PROPOSAL] Reshuffle Allowing Duplicates

2024-02-08 Thread Robert Burke
Was that only October? Wow.

Option 2 SGTM, with the adjustment to making the core of the URN
"redistribute_allowing_duplicates" instead of building from the unspecified
Reshuffle semantics.

Transforms getting updated to use the new transform can have their
@RequiresStableInputs annotation added  accordingly if they need that
property per previous discussions.



On Thu, Feb 8, 2024, 10:31 AM Kenneth Knowles  wrote:

>
>
> On Wed, Feb 7, 2024 at 5:15 PM Robert Burke  wrote:
>
>> OK, so my stance is a configurable Reshuffle might be interesting, so my
>> vote is +1, along the following lines.
>>
>> 1. Use a new URN (beam:transform:reshuffle:v2) and attach a new
>> ReshufflePayload to it.
>>
>
> Ah, I see there's more than one variation of the "new URN" approach.
> Namely, you have a new version of an existing URN prefix, while I had in
> mind that it was a totally new base URN. In other words the open question I
> meant to pose is between these options:
>
> 1. beam:transform:reshuffle:v2 + { allowing_duplicates: true }
> 2. beam:transform:reshuffle_allowing_duplicates:v1 {}
>
> The most compelling argument in favor of option 2 is that it could have a
> distinct payload type associated with the different URN (maybe parameters
> around tweaking how much duplication? I don't know... I actually expect
> neither payload to evolve much if at all).
>
> There were also two comments in favor of option 2 on the design doc.
>
>   -> Unknown "urns for composite transforms" already default to the
>> subtransform graph implementation for most (all?) runners.
>>   -> Having a payload to toggle this behavior then can have whatever
>> desired behavior we like. It also allows for additional configurations
>> added in later on. This is preferable to a plethora of one-off urns IMHO.
>> We can have SDKs gate configuration combinations as needed if additional
>> ones appear.
>>
>> 2. It's very cheap to add but also ignore, as the default is "Do what
>> we're already doing without change", and not all SDKs need to add it right
>> away. It's more important that the portable way is defined at least, so
>> it's easy for other SDKs to add and handle it.
>>
>> I would prefer we have a clear starting point on what Reshuffle does
>> though. I remain a fan of "The Reshuffle (v2) Transform is a user
>> designated hint to a runner for a change in parallelism. By default, it
>> produces an output PCollection that has the same elements as the input
>> PCollection".
>>
>
> +1 this is a better phrasing of the spec I propose in
> https://s.apache.org/beam-redistribute but let's not get into it here if
> we can, and just evaluate the delta from that design to
> https://s.apache.org/beam-reshuffle-allowing-duplicates
>
> Kenn
>
>
>> It remains an open question about what that means for
>> checkpointing/durability behavior, but that's largely been runner dependent
>> anyway. I admit the above definition is biased by the uses of Reshuffle I'm
>> aware of, which largely are to incur a fusion break in the execution graph.
>>
>> Robert Burke
>> Beam Go Busybody
>>
>> On 2024/01/31 16:01:33 Kenneth Knowles wrote:
>> > On Wed, Jan 31, 2024 at 4:21 AM Jan Lukavský  wrote:
>> >
>> > > Hi,
>> > >
>> > > if I understand this proposal correctly, the motivation is actually
>> > > reducing latency by bypassing bundle atomic guarantees, bundles after
>> "at
>> > > least once" Reshuffle would be reconstructed independently of the
>> > > pre-shuffle bundling. Provided this is correct, it seems that the
>> behavior
>> > > is slightly more general than for the case of Reshuffle. We have
>> already
>> > > some transforms that manipulate a specific property of a PCollection
>> - if
>> > > it may or might not contain duplicates. That is manipulated in two
>> ways -
>> > > explicitly removing duplicates based on IDs on sources that generate
>> > > duplicates and using @RequiresStableInput, mostly in sinks. These
>> > > techniques modify an inherent property of a PCollection, that is if it
>> > > contains or does not contain possible duplicates originating from the
>> same
>> > > input element.
>> > >
>> > > There are two types of duplicates - duplicate elements in _different
>> > > bundles_ (typically from at-least-once sources) and duplicates
>> arising due
>> > > to bundle reprocessing (affecting only transforms with side-effects,
>> that

Re: Playground: File Explorer?

2024-02-08 Thread Robert Burke
We redeploy the playground along with the release, so once 2.54.0 RC2 has
been validated and voted on, I'll be redeploying it with 2.54.0.

On Thu, Feb 8, 2024, 7:18 AM Joey Tran  wrote:

> Here's two:
>
> https://play.beam.apache.org/?path=SDK_PYTHON_MultipleOutputPardo=python
> https://play.beam.apache.org/?path=SDK_PYTHON_WordCount=python
>
> Also, how often does playground get redeployed? I put up a PR[1] that's
> been merged for try to reduce the amount of logging these examples produce
> and I'm not sure if it's not working or if playground just hasn't been
> redeployed in the last month or so
>
> [1] https://github.com/apache/beam/pull/29948
>
>
> On Thu, Feb 8, 2024 at 10:12 AM XQ Hu via dev  wrote:
>
>> Can you provide which example you are referring to? I checked a few
>> examples and usually we use beam.Map(print) to display some output values.
>>
>> On Wed, Feb 7, 2024 at 8:55 PM Joey Tran 
>> wrote:
>>
>>> Hey all,
>>>
>>> I've been really trying to use Playground for educating new Beam users
>>> but it feels like there's something missing. A lot of examples (e.g.
>>> Multiple ParDo Outputs) for at least the python API don't seem to do
>>> anything observable. For example, the Multiple ParDo Outputs example writes
>>> to a file but is there any way to actually look at written out files? I
>>> feel like maybe I'm missing something.
>>>
>>> Best,
>>> Joey
>>>
>>


Re: [DESIGN PROPOSAL] Reshuffle Allowing Duplicates

2024-02-07 Thread Robert Burke
OK, so my stance is a configurable Reshuffle might be interesting, so my vote 
is +1, along the following lines.

1. Use a new URN (beam:transform:reshuffle:v2) and attach a new 
ReshufflePayload to it.
  -> Unknown "urns for composite transforms" already default to the 
subtransform graph implementation for most (all?) runners.
  -> Having a payload to toggle this behavior then can have whatever desired 
behavior we like. It also allows for additional configurations added in later 
on. This is preferable to a plethora of one-off urns IMHO. We can have SDKs 
gate configuration combinations as needed if additional ones appear.
 
2. It's very cheap to add but also ignore, as the default is "Do what we're 
already doing without change", and not all SDKs need to add it right away. It's 
more important that the portable way is defined at least, so it's easy for 
other SDKs to add and handle it.

I would prefer we have a clear starting point on what Reshuffle does though. I 
remain a fan of "The Reshuffle (v2) Transform is a user designated hint to a 
runner for a change in parallelism. By default, it produces an output 
PCollection that has the same elements as the input PCollection".

It remains an open question about what that means for checkpointing/durability 
behavior, but that's largely been runner dependent anyway. I admit the above 
definition is biased by the uses of Reshuffle I'm aware of, which largely are 
to incur a fusion break in the execution graph.

Robert Burke
Beam Go Busybody

On 2024/01/31 16:01:33 Kenneth Knowles wrote:
> On Wed, Jan 31, 2024 at 4:21 AM Jan Lukavský  wrote:
> 
> > Hi,
> >
> > if I understand this proposal correctly, the motivation is actually
> > reducing latency by bypassing bundle atomic guarantees, bundles after "at
> > least once" Reshuffle would be reconstructed independently of the
> > pre-shuffle bundling. Provided this is correct, it seems that the behavior
> > is slightly more general than for the case of Reshuffle. We have already
> > some transforms that manipulate a specific property of a PCollection - if
> > it may or might not contain duplicates. That is manipulated in two ways -
> > explicitly removing duplicates based on IDs on sources that generate
> > duplicates and using @RequiresStableInput, mostly in sinks. These
> > techniques modify an inherent property of a PCollection, that is if it
> > contains or does not contain possible duplicates originating from the same
> > input element.
> >
> > There are two types of duplicates - duplicate elements in _different
> > bundles_ (typically from at-least-once sources) and duplicates arising due
> > to bundle reprocessing (affecting only transforms with side-effects, that
> > is what we solve by @RequiresStableInput). The point I'm trying to get to -
> > should we add these properties to PCollections (contains cross-bundle
> > duplicates vs. does not) and PTransforms ("outputs deduplicated elements"
> > and "requires stable input")? That would allow us to analyze the Pipeline
> > DAG and provide appropriate implementation for Reshuffle automatically, so
> > that a new URN or flag would not be needed. Moreover, this might be useful
> > for a broader range of optimizations.
> >
> > WDYT?
> >
> These are interesting ideas that could be useful. I think they achieve a
> different goal in my case. I actually want to explicitly allow
> Reshuffle.allowingDuplicates() to skip expensive parts of its
> implementation that are used to prevent duplicates.
> 
> The property that would make it possible to automate this in the case of
> combiners, or at least validate that the pipeline still gives 100% accurate
> answers, would be something like @InsensitiveToDuplicateElements which is
> longer and less esoteric than @Idempotent. For situations where there is a
> source or sink that only has at-least-once guarantees then yea maybe the
> property "has duplicates" will let you know that you may as well use the
> duplicating reshuffle without any loss. But still, you may not want to
> introduce *more* duplicates.
> 
> I would say my proposal is a step in this direction that would gain some
> experience and tools that we might later use in a more automated way.
> 
> Kenn
> 
> >  Jan
> > On 1/30/24 23:22, Robert Burke wrote:
> >
> > Is the benefit of this proposal just the bounded deviation from the
> > existing reshuffle?
> >
> > Reshuffle is already rather dictated by arbitrary runner choice, from
> > simply ignoring the node, to forcing a materialization break, to a full
> > shuffle implementation which has additional side effects.
> >
> > But model wi

[VOTE] Release 2.54.0, release candidate #2

2024-02-06 Thread Robert Burke
Hi everyone,
Please review and vote on the release candidate #2 for the version 2.54.0,
as follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)


Reviewers are encouraged to test their own use cases with the release
candidate, and vote +1 if
no issues are found. Only PMC member votes will count towards the final
vote, but votes from all
community members is encouraged and helpful for finding regressions; you
can either test your own
use cases [13] or use cases from the validation sheet [10].

The complete staging area is available for your review, which includes:
* GitHub Release notes [1],
* the official Apache source release to be deployed to dist.apache.org [2],
which is signed with the key with fingerprint D20316F712213422 [3],
* all artifacts to be deployed to the Maven Central Repository [4],
* source code tag "v2.54.0-RC2" [5],
* website pull request listing the release [6], the blog post [6], and
publishing the API reference manual [7].
* Python artifacts are deployed along with the source release to the
dist.apache.org [2] and PyPI[8].
* Go artifacts and documentation are available at pkg.go.dev [9]
* Validation sheet with a tab for 2.54.0 release to help with validation
[10].
* Docker images published to Docker Hub [11].
* PR to run tests against release branch [12].

The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PMC affirmative votes.

For guidelines on how to try the release in your projects, check out our RC
testing guide [13].

Thanks,
Robert Burke
Beam 2.54.0 Release Manager

[1] https://github.com/apache/beam/milestone/18?closed=1
[2] https://dist.apache.org/repos/dist/dev/beam/2.54.0/
[3] https://dist.apache.org/repos/dist/release/beam/KEYS
[4] https://repository.apache.org/content/repositories/orgapachebeam-1368/
[5] https://github.com/apache/beam/tree/v2.54.0-RC2
[6] https://github.com/apache/beam/pull/30201
[7] https://github.com/apache/beam-site/pull/659
[8] https://pypi.org/project/apache-beam/2.54.0rc2/
[9]
https://pkg.go.dev/github.com/apache/beam/sdks/v2@v2.54.0-RC2/go/pkg/beam
[10]
https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=28763708
[11] https://hub.docker.com/search?q=apache%2Fbeam=image
[12] https://github.com/apache/beam/pull/30104
[13]
https://github.com/apache/beam/blob/master/contributor-docs/rc-testing-guide.md


[VOTE] Release 2.54.0, release candidate #2

2024-02-06 Thread Robert Burke via dev
Hi everyone,
Please review and vote on the release candidate #2 for the version 2.54.0,
as follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)


Reviewers are encouraged to test their own use cases with the release
candidate, and vote +1 if
no issues are found. Only PMC member votes will count towards the final
vote, but votes from all
community members is encouraged and helpful for finding regressions; you
can either test your own
use cases [13] or use cases from the validation sheet [10].

The complete staging area is available for your review, which includes:
* GitHub Release notes [1],
* the official Apache source release to be deployed to dist.apache.org [2],
which is signed with the key with fingerprint D20316F712213422 [3],
* all artifacts to be deployed to the Maven Central Repository [4],
* source code tag "v2.54.0-RC2" [5],
* website pull request listing the release [6], the blog post [6], and
publishing the API reference manual [7].
* Python artifacts are deployed along with the source release to the
dist.apache.org [2] and PyPI[8].
* Go artifacts and documentation are available at pkg.go.dev [9]
* Validation sheet with a tab for 2.54.0 release to help with validation
[10].
* Docker images published to Docker Hub [11].
* PR to run tests against release branch [12].

The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PMC affirmative votes.

For guidelines on how to try the release in your projects, check out our RC
testing guide [13].

Thanks,
Robert Burke
Beam 2.54.0 Release Manager

[1] https://github.com/apache/beam/milestone/18?closed=1
[2] https://dist.apache.org/repos/dist/dev/beam/2.54.0/
[3] https://dist.apache.org/repos/dist/release/beam/KEYS
[4] https://repository.apache.org/content/repositories/orgapachebeam-1368/
[5] https://github.com/apache/beam/tree/v2.54.0-RC2
[6] https://github.com/apache/beam/pull/30201
[7] https://github.com/apache/beam-site/pull/659
[8] https://pypi.org/project/apache-beam/2.54.0rc2/
[9]
https://pkg.go.dev/github.com/apache/beam/sdks/v2@v2.54.0-RC2/go/pkg/beam
[10]
https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=28763708
[11] https://hub.docker.com/search?q=apache%2Fbeam=image
[12] https://github.com/apache/beam/pull/30104
[13]
https://github.com/apache/beam/blob/master/contributor-docs/rc-testing-guide.md


Re: [VOTE] Release 2.54.0, release candidate #1

2024-02-06 Thread Robert Burke
The release branch nearly has all the suites passing!

I don't presently consider the failing ones blocking as they have been permared 
for at least 2 releases already (using just the number to refer to the tracking 
issues).

PostCommit XVR GoUsingJava Dataflow - Has never been successful, using #28339 
to track.

PostCommit Java Sickbay - Has never been successful, filed #30236 to track.

PostCommit TransformService Direct - Last succeeded 4 months ago, filed #30238 
to track.

PostCommit XVR Direct - Last succeeded 3 months ago, using #28972 to track.

PostCommit Java IO Performance Tests - Last succeed 5 months ago, using #28330 
to track.

PostCommit Website Test - Can't interpret the results. It appears like it 
passed, but then there are failures I can't find. Will be tracked in the 
release blog PR. #30201

I'll be addressing the Website Test failures since I'd be nice to have a 
working website when we publish the release blog, and likely digging into the 
current issue with the GoUsingJava suite issues.

Absolute last call for cherry picks for RC2. I will be doing my final cleanup 
and then start the RC build in an hour or two.

Robert Burke
Beam 2.54.0 Release Manager

On 2024/02/05 19:42:32 Robert Burke wrote:
> I think that's serious enough to warrant another release candidate. However, 
> please do continue validation so we can reduce iteration cycle time.
> 
> Currently resolving [6] is the blocker for getting RC2 built, but if it's 
> sorted out sooner, I'm putting a deadline of Noon PST on Feb 6th for any 
> other small cherry picks. The following are ones that have been already 
> requested, and will be cherry picked this afternoon.
> 
> https://github.com/apache/beam/pull/30148 (resolves a caching error race 
> condition on a non-volatile flag field in the non-portable Dataflow Java 
> streaming harness. )
> 
> https://github.com/apache/beam/pull/30156 (resolves an issue with long 
> running python streaming jobs where an ID collision may occur).
> 
> Please reply to this thread for other cherry picks. It's not a guarantee that 
> the cherry pick will occur, but I can't evaluate it without knowing about it.
> 
> Robert Burke
> Beam 2.54.0 Release Manager
> 
> 
> 
> 
> 
> On 2024/02/05 03:31:36 Yi Hu via dev wrote:
> > Thanks for taking care of the release process! After validation two
> > breaking change was found
> > (1) Python Xlang Gcp Direct and Python Xlang Gcp Dataflow PostCommit tests
> > [1, 2]. It affects Python xlang BigQueryIO write (STORAGE_WRITE_API mode)
> > configuration. Filed [3] and pull request for cherry pick [4].
> > (2) Validation on Dataflow Template [5] found that OutputReceiver interface
> > requiring a new method (`outputWindowedValue`) to be implemented. Filed [6]
> > to determine if this is a release blocker or not, and I am working on a PR
> > for fix soon.
> > 
> > That said I am -1 to this vote.
> > 
> > [1]
> > https://github.com/apache/beam/actions/runs/7647377867/job/20838203146?pr=30104
> > [2]
> > https://github.com/apache/beam/actions/runs/7647377805/job/20985387650?pr=30104
> > [3] https://github.com/apache/beam/issues/30159
> > [4] https://github.com/apache/beam/pull/30189
> > [5]
> > https://github.com/GoogleCloudPlatform/DataflowTemplates/actions/runs/7762162776/job/21172087527
> > [6] https://github.com/apache/beam/issues/30203
> > 
> > On Fri, Feb 2, 2024 at 6:01 PM XQ Hu via dev  wrote:
> > 
> > > +1 validated by running the simple RunInference ML pipeline:
> > > https://github.com/google/dataflow-ml-starter/actions/runs/7761835540/job/21171080332
> > >
> > > On Fri, Feb 2, 2024 at 4:10 PM Robert Burke  wrote:
> > >
> > >> Hi everyone,
> > >> Please review and vote on the release candidate #1 for the version
> > >> 2.54.0, as follows:
> > >> [ ] +1, Approve the release
> > >> [ ] -1, Do not approve the release (please provide specific comments)
> > >>
> > >>
> > >> Reviewers are encouraged to test their own use cases with the release
> > >> candidate, and vote +1 if
> > >> no issues are found. Only PMC member votes will count towards the final
> > >> vote, but votes from all
> > >> community members is encouraged and helpful for finding regressions; you
> > >> can either test your own
> > >> use cases [13] or use cases from the validation sheet [10].
> > >>
> > >> The complete staging area is available for your review, which includes:
> > >> * GitHub Release notes [1],
> > >> * the official Apache source release to be deployed to dist.apache.org
> > &

Re: [VOTE] Release 2.54.0, release candidate #1

2024-02-05 Thread Robert Burke
I think that's serious enough to warrant another release candidate. However, 
please do continue validation so we can reduce iteration cycle time.

Currently resolving [6] is the blocker for getting RC2 built, but if it's 
sorted out sooner, I'm putting a deadline of Noon PST on Feb 6th for any other 
small cherry picks. The following are ones that have been already requested, 
and will be cherry picked this afternoon.

https://github.com/apache/beam/pull/30148 (resolves a caching error race 
condition on a non-volatile flag field in the non-portable Dataflow Java 
streaming harness. )

https://github.com/apache/beam/pull/30156 (resolves an issue with long running 
python streaming jobs where an ID collision may occur).

Please reply to this thread for other cherry picks. It's not a guarantee that 
the cherry pick will occur, but I can't evaluate it without knowing about it.

Robert Burke
Beam 2.54.0 Release Manager





On 2024/02/05 03:31:36 Yi Hu via dev wrote:
> Thanks for taking care of the release process! After validation two
> breaking change was found
> (1) Python Xlang Gcp Direct and Python Xlang Gcp Dataflow PostCommit tests
> [1, 2]. It affects Python xlang BigQueryIO write (STORAGE_WRITE_API mode)
> configuration. Filed [3] and pull request for cherry pick [4].
> (2) Validation on Dataflow Template [5] found that OutputReceiver interface
> requiring a new method (`outputWindowedValue`) to be implemented. Filed [6]
> to determine if this is a release blocker or not, and I am working on a PR
> for fix soon.
> 
> That said I am -1 to this vote.
> 
> [1]
> https://github.com/apache/beam/actions/runs/7647377867/job/20838203146?pr=30104
> [2]
> https://github.com/apache/beam/actions/runs/7647377805/job/20985387650?pr=30104
> [3] https://github.com/apache/beam/issues/30159
> [4] https://github.com/apache/beam/pull/30189
> [5]
> https://github.com/GoogleCloudPlatform/DataflowTemplates/actions/runs/7762162776/job/21172087527
> [6] https://github.com/apache/beam/issues/30203
> 
> On Fri, Feb 2, 2024 at 6:01 PM XQ Hu via dev  wrote:
> 
> > +1 validated by running the simple RunInference ML pipeline:
> > https://github.com/google/dataflow-ml-starter/actions/runs/7761835540/job/21171080332
> >
> > On Fri, Feb 2, 2024 at 4:10 PM Robert Burke  wrote:
> >
> >> Hi everyone,
> >> Please review and vote on the release candidate #1 for the version
> >> 2.54.0, as follows:
> >> [ ] +1, Approve the release
> >> [ ] -1, Do not approve the release (please provide specific comments)
> >>
> >>
> >> Reviewers are encouraged to test their own use cases with the release
> >> candidate, and vote +1 if
> >> no issues are found. Only PMC member votes will count towards the final
> >> vote, but votes from all
> >> community members is encouraged and helpful for finding regressions; you
> >> can either test your own
> >> use cases [13] or use cases from the validation sheet [10].
> >>
> >> The complete staging area is available for your review, which includes:
> >> * GitHub Release notes [1],
> >> * the official Apache source release to be deployed to dist.apache.org
> >> [2], which is signed with the key with fingerprint D20316F712213422 [3],
> >> * all artifacts to be deployed to the Maven Central Repository [4],
> >> * source code tag "v2.54.0-RC1" [5],
> >> * website pull request listing the release [6], the blog post [6], and
> >> publishing the API reference manual [7].
> >> * Python artifacts are deployed along with the source release to the
> >> dist.apache.org [2] and PyPI[8].
> >> * Go artifacts and documentation are available at pkg.go.dev [9]
> >> * Validation sheet with a tab for 2.54.0 release to help with validation
> >> [10].
> >> * Docker images published to Docker Hub [11].
> >> * PR to run tests against release branch [12].
> >>   * Legacy Dataflow Java Worker image has been published, so the failing
> >> tests are being re-run.
> >>
> >> The vote will be open for at least 72 hours. It is adopted by majority
> >> approval, with at least 3 PMC affirmative votes.
> >>
> >> For guidelines on how to try the release in your projects, check out our
> >> RC testing guide [13].
> >>
> >> Thanks,
> >> Robert Burke
> >> Beam 2.54.0 Release Manager
> >>
> >> [1] https://github.com/apache/beam/milestone/18?closed=1
> >> [2] https://dist.apache.org/repos/dist/dev/beam/2.54.0/
> >> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> >> [4]
> >> https://repository.apache.or

[VOTE] Release 2.54.0, release candidate #1

2024-02-02 Thread Robert Burke
Hi everyone,
Please review and vote on the release candidate #1 for the version 2.54.0,
as follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)


Reviewers are encouraged to test their own use cases with the release
candidate, and vote +1 if
no issues are found. Only PMC member votes will count towards the final
vote, but votes from all
community members is encouraged and helpful for finding regressions; you
can either test your own
use cases [13] or use cases from the validation sheet [10].

The complete staging area is available for your review, which includes:
* GitHub Release notes [1],
* the official Apache source release to be deployed to dist.apache.org [2],
which is signed with the key with fingerprint D20316F712213422 [3],
* all artifacts to be deployed to the Maven Central Repository [4],
* source code tag "v2.54.0-RC1" [5],
* website pull request listing the release [6], the blog post [6], and
publishing the API reference manual [7].
* Python artifacts are deployed along with the source release to the
dist.apache.org [2] and PyPI[8].
* Go artifacts and documentation are available at pkg.go.dev [9]
* Validation sheet with a tab for 2.54.0 release to help with validation
[10].
* Docker images published to Docker Hub [11].
* PR to run tests against release branch [12].
  * Legacy Dataflow Java Worker image has been published, so the failing
tests are being re-run.

The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PMC affirmative votes.

For guidelines on how to try the release in your projects, check out our RC
testing guide [13].

Thanks,
Robert Burke
Beam 2.54.0 Release Manager

[1] https://github.com/apache/beam/milestone/18?closed=1
[2] https://dist.apache.org/repos/dist/dev/beam/2.54.0/
[3] https://dist.apache.org/repos/dist/release/beam/KEYS
[4] https://repository.apache.org/content/repositories/orgapachebeam-1367/
[5] https://github.com/apache/beam/tree/v2.54.0-RC1
[6] https://github.com/apache/beam/pull/30201
[7] https://github.com/apache/beam-site/pull/658
[8] https://pypi.org/project/apache-beam/2.54.0rc1/
[9]
https://pkg.go.dev/github.com/apache/beam/sdks/v2@v2.54.0-RC1/go/pkg/beam
[10]
https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=28763708
[11] https://hub.docker.com/search?q=apache%2Fbeam=image
[12] https://github.com/apache/beam/pull/30104
[13]
https://github.com/apache/beam/blob/master/contributor-docs/rc-testing-guide.md


Re: [DESIGN PROPOSAL] Reshuffle Allowing Duplicates

2024-01-30 Thread Robert Burke
Is the benefit of this proposal just the bounded deviation from the
existing reshuffle?

Reshuffle is already rather dictated by arbitrary runner choice, from
simply ignoring the node, to forcing a materialization break, to a full
shuffle implementation which has additional side effects.

But model wise I don't believe it guarantees specific checkpointing or
re-execution behavior as currently specified. The proto only says it
represents the operation (without specifying the behavior, that is a big
problem).

I guess my concern here is that it implies/codifies that the existing
reshuffle has more behavior than it promises outside of the Java SDK.

"Allowing duplicates" WRT reshuffle is tricky. It feels like mostly allows
an implementation that may mean the inputs into the reshuffle might be
re-executed for example. But that's always under the runner's discretion ,
and ultimately it could also prevent even getting the intended benefit of a
reshuffle (notionally, just a fusion break).

Is there even a valid way to implement the notion of a reshuffle that leads
to duplicates outside of a retry/resilience case?

---

To be clear, I'm not against the proposal. I'm against that its being built
on a non-existent foundation. If the behavior isn't already defined, it's
impossible to specify a real deviation from it.

I'm all for more specific behaviors if means we actually clarify what the
original version is in the protos, since its news to me ( just now, because
I looked) that the Java reshuffle promises GBK-like side effects. But
that's a long deprecated transform without a satisfying replacement for
it's usage, so it may be moot.

Robert Burke



On Tue, Jan 30, 2024, 1:34 PM Kenneth Knowles  wrote:

> Hi all,
>
> Just when you thought I had squeezed all the possible interest out of this
> most boring-seeming of transforms :-)
>
> I wrote up a very quick proposal as a doc [1]. It is short enough that I
> will also put the main idea and main question in this email so you can
> quickly read. Best to put comments in the.
>
> Main idea: add a variation of Reshuffle that allows duplicates, aka "at
> least once", so that users and runners can benefit from efficiency if it is
> possible
>
> Main question: is it best as a parameter to existing reshuffle transforms
> or as new URN(s)? I have proposed it as a parameter but I think either one
> could work.
>
> I would love feedback on the main idea, main question, or anywhere on the
> doc.
>
> Thanks!
>
> Kenn
>
> [1] https://s.apache.org/beam-reshuffle-allowing-duplicates
>


Re: [Release 2.54.0] Release Branch has been Cut!

2024-01-29 Thread Robert Burke
Hello Beam Devs!

Branch stabilization in progress but I think we're almost at container builds 
later today.

The validation PR is https://github.com/apache/beam/pull/30104

Two outstanding issues on the milestone:

* Various existing Python PostCommit Flakes: 
https://github.com/apache/beam/issues/29214
 (arbitrary python versions, some validation PR failures is due to Dataflow 
congestion)

* Race condition in Dataflow sampler: 
https://github.com/apache/beam/issues/29987

The latter is waiting on a PR that will be cherry picked into the release.

Most Dataflow Java based suites can't pass until the Google internal containers 
have been 
generated (the ones that do pass at this stage tend to be validating the 
container, and build 
their own.) 

Thank you for your patience and cooperation.

Robert Burke
Beam 2.54.0 Release Manager


On 2024/01/24 22:55:32 Robert Burke wrote:
> Hello Beam Devs!
> 
> The 2.54.0 release branch has been cut [0]!
> 
> There are 5 outstanding issues to be triaged for 2.54.0 in the release
> milestones [1]
> 
> They are presently the following issues:
> 
> https://github.com/apache/beam/issues/25590
> 
> https://github.com/apache/beam/issues/29214
> 
> https://github.com/apache/beam/issues/29987
> 
> https://github.com/apache/beam/pull/29834
> 
> https://github.com/apache/beam/issues/30095
> 
> I'll be going through these and determining if they are release blocking,
> per the release guide [2]. As I stabilize and verify the release branch, I
> may file additional issues to be resolved before we can cut an RC1. If so,
> I'll be adding them to this thread.
> 
> Thank you very much for your cooperation and support.
> 
> Robert Burke
> Your friendly neighbourhood Beam 2.54.0 release manager
> 
> [0] https://github.com/apache/beam/tree/release-2.54.0
> [1] https://github.com/apache/beam/milestone/18
> [2]
> https://github.com/apache/beam/blob/master/contributor-docs/release-guide.md#triage-release-blocking-issues-in-github
> 


[Release 2.54.0] Release Branch has been Cut!

2024-01-24 Thread Robert Burke
Hello Beam Devs!

The 2.54.0 release branch has been cut [0]!

There are 5 outstanding issues to be triaged for 2.54.0 in the release
milestones [1]

They are presently the following issues:

https://github.com/apache/beam/issues/25590

https://github.com/apache/beam/issues/29214

https://github.com/apache/beam/issues/29987

https://github.com/apache/beam/pull/29834

https://github.com/apache/beam/issues/30095

I'll be going through these and determining if they are release blocking,
per the release guide [2]. As I stabilize and verify the release branch, I
may file additional issues to be resolved before we can cut an RC1. If so,
I'll be adding them to this thread.

Thank you very much for your cooperation and support.

Robert Burke
Your friendly neighbourhood Beam 2.54.0 release manager

[0] https://github.com/apache/beam/tree/release-2.54.0
[1] https://github.com/apache/beam/milestone/18
[2]
https://github.com/apache/beam/blob/master/contributor-docs/release-guide.md#triage-release-blocking-issues-in-github


Re: Google Artifact Registry detects critical vuln CVE-2023-45853 in beam dataflow

2024-01-24 Thread Robert Burke
Thanks for the shout out XQ! And thanks for bringing this up.

Moving to a Distroless base for Go SDK images should reduce the
vulnerability surface to whichever version of glibc we have packaged in .

I do have some concerns around if a user would like to extend the image
(not having shells or package managers make image extensions harder) the
important part for any custom Beam Go SDK image is the entry point is the
container boot program. I hope to add clear documentation on at least one
way of doing that before the hard Distroless switch.


On Wed, Jan 24, 2024, 10:36 AM 8 Gianfortoni <8...@tokentransit.com> wrote:

> Hi,
>
> Thanks for the tips. After talking with my team, I also realized that our
> Dockerfile might not even be the same one used in your repository.
>
> Best,
> 8
>
> On Wed, Jan 24, 2024 at 12:58 PM 'XQ Hu' via Engineering <
> e...@tokentransit.com> wrote:
>
>> FYI. The ongoing PR: https://github.com/apache/beam/pull/30011 will
>> switch to the distroless images, which will have less vulnerabilities in
>> the future.
>>
>> On Wed, Jan 24, 2024 at 12:32 PM Valentyn Tymofieiev 
>> wrote:
>>
>>> > Does the beam project generally attempt to address as many of these
>>> vulnerabilities?
>>>
>>> Beam does not retroactively patch released container images, but we use
>>> the latest available docker base images during each Beam release. Many
>>> vulnerabilities concern software packages preinstalled in the Docker base
>>> layer (currently we use Debian bookworm). Such packages are not necessarily
>>> used over the course of running a Beam pipeline, so some attack vectors are
>>> not applicable but of course it would depend on a particular vulnerability.
>>>
>>> Note that Beam users can supply custom container images to use in their
>>> pipeline. For example, one can create an image based on 'distroless'
>>> distribution [1], which would significantly reduce the number of
>>> preinstalled packages. For more information on customizing container
>>> images, see [2] [3].
>>>
>>> [1] ttps://github.com/GoogleContainerTools/distroless
>>> [2] https://beam.apache.org/documentation/runtime/environments/
>>> [3] https://cloud.google.com/dataflow/docs/guides/build-container-image
>>>
>>> On Tue, Jan 23, 2024 at 1:30 PM 8 Gianfortoni <8...@tokentransit.com>
>>> wrote:
>>>
 Hi team,

 We recently starting using the Google Artifact Registry's container
 scanning, and have been able to fix almost all critical vulnerabilities
 across our codebase. The one exception is the docker container created when
 we deploy our dataflow beam jobs.

 The "critical" vulnerability reported is
 https://security-tracker.debian.org/tracker/CVE-2023-45853, and we are
 using Apache Beam golang v2.53.0. I cannot tell whether this is something
 that is even easily fixable in the docker setup or whether beam is even
 affected by this issue.

 Has anyone else run into this issue? Would a beam dataflow job actually
 be affected or is this more relevant for someone actually running servers
 on this particular version of debian? Should we just be ignoring this
 "critical" vulnerability since it is just in the docker container for a
 couple of batch jobs? Does the beam project generally attempt to address as
 many of these vulnerabilities?

 Best,
 8
 Token Transit

>>> --
>> You received this message because you are subscribed to the Google Groups
>> "Engineering" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to eng+unsubscr...@tokentransit.com.
>> To view this discussion on the web visit
>> https://groups.google.com/a/tokentransit.com/d/msgid/eng/CAO%2BjvV0vn1PZrCmJxhhci3ExL8zY3tKmeHYsDBtWhpxP44sh1A%40mail.gmail.com
>> 
>> .
>>
>


Re: @RequiresTimeSortedInput adoption by runners

2024-01-20 Thread Robert Burke
g to be implemented more 
> slowly or less scalably as a fallback, I think it may be best to simply be 
> upfront about being unable to really run it. It would depend on the 
> situation. For requiring time sorted input, the manual implementation is 
> probably similar to what a streaming runner might do, so it might make sense.
>
> Kenn
>
> On Fri, Jan 19, 2024 at 11:05 AM Robert Burke  
>  wrote:
>
> I certainly don't have the deeper java insight here. So one more portable 
> based reply and then I'll step back on the Java specifics.
>
> Portable runners only really have the "unknown Composite" fallback option, 
> where if the Composite's URN isn't known to the runner, it should use the 
> subgraph that is being wrapped.
>
> I suppose the protocol could be expanded : If a composite transform with a 
> ParDo payload, and urn has features the runner can't handle, then it could 
> use the fallback graph as well.
>
> The SDK would have then still needed to have construct the fallback graph 
> into the Pipeline proto. This doesn't sound incompatible with what you've 
> suggested the Java SDK could do, but it avoids the runner needing to be aware 
> of a specific implementation requirement around a feature it doesn't support. 
>  If it has to do something specific to support an SDK specific mechanism, 
> that's still supporting the feature, but I fear it's not a great road to 
> tread on for runners to add SDK specific implementation details.
>
> If a (portable) runner is going to spend work on doing something to handle 
> RequiresTimeSortedInput, it's probably easier to handle it generally than to 
> try to enable a Java specific work around. I'm not even sure how that could 
> work since the SDK would then need a special interpretation of what a runner 
> sent back for it to do any SDK side special backup handling, vs the simple 
> execution of the given transform.
>
> It's entirely possible I've over simplified the "fallback" protocol described 
> above, so this thread is still useful for my Prism work, especially if I see 
> any similar situations once I start on the Java Validates Runner suite.
>
> Robert Burke
> Beam Go Busybody
>
> On Fri, Jan 19, 2024, 6:41 AM Jan Lukavský  
>  wrote:
>
> I was primarily focused on Java SDK (and core-contruction-java), but 
> generally speaking, any SDK can provide default expansion that runners can 
> use so that it is not (should not be) required to implement this manually.
> Currently, in Java SDK, the annotation is wired up into StatefulDoFnRunner, 
> which (as name suggests) can be used for running stateful DoFns. The problem 
> is that not every runner is using this facility. Java SDK generally supports 
> providing default expansions of transforms, but _only for transforms that do 
> not have to work with dynamic state_. This is not the case for this 
> annotation - a default implementation for @RequiresTimeSortedInput has to 
> take another DoFn as input, and wire its lifecycle in a way that elements are 
> buffered in (dynamically created) buffer and fed into the downstream DoFn 
> only when timer fires.
>
> If I narrow down my line of thinking, it would be possible to:
>  a) create something like "dynamic pipeline expansion", which would make it 
> possible work with PTransforms in this way (probably would require some 
> ByteBuddy magic)
>  b) wire this up to DoFnInvoker, which takes DoFn and creates class that is 
> used by runners for feeding data
>
> Option b) would ensure that actually all runners support such expansion, but 
> seems to be somewhat hacky and too specific to this case. Moreover, it would 
> require knowledge if the expansion is actually required by the runner (e.g. 
> if the annotation is supported explicitly - most likely for batch execution). 
> Therefore I'd be in favor of option a), this might be reusable by a broader 
> range of default expansions.
>
> In other SDKs than Java this might have different implications, the reason 
> why it is somewhat more complicated to do dynamic (or generic?) expansions of 
> PTransforms in Java is mostly due to how DoFns are implemented in terms of 
> annotations and the DoFnInvokers involved for efficiency.
>
>  Jan
>
> On 1/18/24 18:35, Robert Burke wrote:
>
> I agree that variable support across Runners does limit the adoption of a 
> feature.  But it's also then limited if the SDKs and their local / direct 
> runners don't yet support the feature. The Go SDK doesn't currently have a 
> way of specifying that annotation, preventing use.  (The lack of mention of 
> the Python direct runner your list implies it's not yet supported by the 
> Python SDK, and a quick search shows that's likely [0])
>
> While no

Re: @RequiresTimeSortedInput adoption by runners

2024-01-19 Thread Robert Burke
I certainly don't have the deeper java insight here. So one more portable
based reply and then I'll step back on the Java specifics.

Portable runners only really have the "unknown Composite" fallback option,
where if the Composite's URN isn't known to the runner, it should use the
subgraph that is being wrapped.

I suppose the protocol could be expanded : If a composite transform with a
ParDo payload, and urn has features the runner can't handle, then it could
use the fallback graph as well.

The SDK would have then still needed to have construct the fallback graph
into the Pipeline proto. This doesn't sound incompatible with what you've
suggested the Java SDK could do, but it avoids the runner needing to be
aware of a specific implementation requirement around a feature it doesn't
support.  If it has to do something specific to support an SDK specific
mechanism, that's still supporting the feature, but I fear it's not a great
road to tread on for runners to add SDK specific implementation details.

If a (portable) runner is going to spend work on doing something to handle
RequiresTimeSortedInput, it's probably easier to handle it generally than
to try to enable a Java specific work around. I'm not even sure how that
could work since the SDK would then need a special interpretation of what a
runner sent back for it to do any SDK side special backup handling, vs the
simple execution of the given transform.

It's entirely possible I've over simplified the "fallback" protocol
described above, so this thread is still useful for my Prism work,
especially if I see any similar situations once I start on the Java
Validates Runner suite.

Robert Burke
Beam Go Busybody

On Fri, Jan 19, 2024, 6:41 AM Jan Lukavský  wrote:

> I was primarily focused on Java SDK (and core-contruction-java), but
> generally speaking, any SDK can provide default expansion that runners can
> use so that it is not (should not be) required to implement this manually.
> Currently, in Java SDK, the annotation is wired up into
> StatefulDoFnRunner, which (as name suggests) can be used for running
> stateful DoFns. The problem is that not every runner is using this
> facility. Java SDK generally supports providing default expansions of
> transforms, but _only for transforms that do not have to work with dynamic
> state_. This is not the case for this annotation - a default implementation
> for @RequiresTimeSortedInput has to take another DoFn as input, and wire
> its lifecycle in a way that elements are buffered in (dynamically created)
> buffer and fed into the downstream DoFn only when timer fires.
>
> If I narrow down my line of thinking, it would be possible to:
>  a) create something like "dynamic pipeline expansion", which would make
> it possible work with PTransforms in this way (probably would require some
> ByteBuddy magic)
>  b) wire this up to DoFnInvoker, which takes DoFn and creates class that
> is used by runners for feeding data
>
> Option b) would ensure that actually all runners support such expansion,
> but seems to be somewhat hacky and too specific to this case. Moreover, it
> would require knowledge if the expansion is actually required by the runner
> (e.g. if the annotation is supported explicitly - most likely for batch
> execution). Therefore I'd be in favor of option a), this might be reusable
> by a broader range of default expansions.
>
> In other SDKs than Java this might have different implications, the reason
> why it is somewhat more complicated to do dynamic (or generic?) expansions
> of PTransforms in Java is mostly due to how DoFns are implemented in terms
> of annotations and the DoFnInvokers involved for efficiency.
>
>  Jan
>
> On 1/18/24 18:35, Robert Burke wrote:
>
> I agree that variable support across Runners does limit the adoption of a 
> feature.  But it's also then limited if the SDKs and their local / direct 
> runners don't yet support the feature. The Go SDK doesn't currently have a 
> way of specifying that annotation, preventing use.  (The lack of mention of 
> the Python direct runner your list implies it's not yet supported by the 
> Python SDK, and a quick search shows that's likely [0])
>
> While not yet widely available to the other SDKs, Prism, the new Go SDK Local 
> Runner, maintains data in event time sorted heaps [1]. The intent was to 
> implement the annotation (among other features) once I start running the Java 
> and Python Validates Runner suites against it.
>
> I think stateful transforms are getting the event ordering on values for 
> "free" as a result [2], but there's no special/behavior at present if the 
> DoFn is consuming the result of a Group By Key.
>
> Part of the issue is that by definition, a GBK "loses" the timestamps of the 
> values, and doesn't emit

Re: @RequiresTimeSortedInput adoption by runners

2024-01-18 Thread Robert Burke
I agree that variable support across Runners does limit the adoption of a 
feature.  But it's also then limited if the SDKs and their local / direct 
runners don't yet support the feature. The Go SDK doesn't currently have a way 
of specifying that annotation, preventing use.  (The lack of mention of the 
Python direct runner your list implies it's not yet supported by the Python 
SDK, and a quick search shows that's likely [0])

While not yet widely available to the other SDKs, Prism, the new Go SDK Local 
Runner, maintains data in event time sorted heaps [1]. The intent was to 
implement the annotation (among other features) once I start running the Java 
and Python Validates Runner suites against it.

I think stateful transforms are getting the event ordering on values for "free" 
as a result [2], but there's no special/behavior at present if the DoFn is 
consuming the result of a Group By Key.

Part of the issue is that by definition, a GBK "loses" the timestamps of the 
values, and doesn't emit them, outside of using them to determine the resulting 
timestamp of the Key... [3]. To make use of the timestamp in the aggregation 
stage a runner would need to do something different in the GBK, namely sorting 
by the timestamp as the data is ingested, and keeping that timestamp around to 
continue the sort. This prevents a more efficient implementation of directly 
arranging the received element bytes into the Iterator format, requiring a post 
process filtering. Not hard, but a little dissatisfying.

Skimming through the discussion, I agree with the general utility goal of the 
annotation, but as with many Beam features, there may be a discoverability 
problem. The feature isn't mentioned in the Programming Guide (AFAICT), and 
trying to find anything on the beam site, the top result is the Javadoc for the 
annotation (which is good, but you still need to know to look for it), and then 
the next time related bit is OrderedListState which doesn't yet have a 
meaningful portable representation last I checked [4], once again limiting 
adoption.

Probably the most critical bit is, while we have broad "handling" of the 
annotation, I'm hard pressed to say we even use the annotation outside of 
tests. A search [5] doesn't show any "Transforms" or "IOs" making use of it 
with the only markdown/documentation about it being the Beam 2.20.0 release 
notes saying it's now supported in Flink and Spark [6].

I will say, this isn't grounds for removing the feature, as I can only check 
what's in the repo, and not what end users have, but it does indicate we didn't 
drive the feature to completion and enable user adoption beyond "This Exists, 
and we can tell you about it if you ask.".

AFAICT this is just one of those features we built, but then proceeded not to 
use within Beam, and evangelize. This is a point we could certainly do better 
on in Beam as a whole.

Robert Burke
Beam Go Busybody

[0]  
https://github.com/search?q=repo%3Aapache%2Fbeam+TIME_SORTED_INPUT+language%3APython=code

[1] 
https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/runners/prism/internal/engine/elementmanager.go#L93

[2] 
https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/runners/prism/internal/engine/elementmanager.go#L1094

[3] 
https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto#L1132

[4] 
https://github.com/apache/beam/issues?q=is%3Aissue+is%3Aopen+OrderedListState

[5] 
https://github.com/search?q=repo%3Aapache%2Fbeam+RequiresTimeSortedInput=code=2

[6] 
https://github.com/apache/beam/blob/b4c23b32f2b80ce052c8a235e5064c69f37df992/website/www/site/content/en/blog/beam-2.20.0.md?plain=1#L46

On 2024/01/18 16:14:56 Jan Lukavský wrote:
> Hi,
> 
> recently I came across the fact that most runners do not support 
> @RequiresTimeSortedInput annotation for sorting per-key data by event 
> timestamp [1]. Actually, runners supporting it seem to be Direct java, 
> Flink and Dataflow batch (as it is a noop there). The annotation has 
> use-cases in time-series data processing, in transaction processing and 
> more. Though it is absolutely possible to implement the time-sorting 
> manually (e.g. [2]), this is actually efficient only in streaming mode, 
> in batch mode the runner typically wants to leverage the internal 
> sort-grouping it already does.
> 
> The original idea was to implement this annotation inside 
> StatefulDoFnRunner, which would be used by majority of runners. It turns 
> out that this is not the case. The question now is, should we use an 
> alternative place to implement the annotation (e.g. Pipeline expansion, 
> or DoFnInvoker) so that more runners can benefit from it automatically 
> (at least for streaming case, batch case needs to be implemented 
> manually)? Do the community find the annotation useful? I'm link

Re: Beam 2.54.0 Release

2024-01-10 Thread Robert Burke
Not sure why newlines were eaten. Hopefully reflowed inline below. 

On 2024/01/10 17:53:56 Robert Burke wrote:
> Hey everyone, Happy New Year!
>
> The next release (2.54.0) branch cut is scheduled for Jan 24, 2024, 2 weeks
> from today, according to the release calendar [1]. I'd like to perform this
> release; I will cut the branch on that date, and cherrypick remaining 
> release-blocking fixes afterwards, if any. 
>
> Please help with the release by: 
>
> - Making sure that any unresolved release blocking issues have
> their "Milestone" marked as "2.54.0 Release" as soon as possible.
> -Reviewing the current release blockers [2] and remove the Milestone if they
> don't meet the criteria at [3]. There are currently 8 release blockers.
>
> Let me know if you have any comments/objections/questions.
> Thanks, Robert Burke
> 
> [1] 
> https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com
>
> [2] https://github.com/apache/beam/milestone/18
>
> [3]  https://beam.apache.org/contribute/release-blocking/
> 


Re: ByteBuddy DoFnInvokers Write Up

2024-01-10 Thread Robert Burke
That's neat! Thanks for writing that up!

On Wed, Jan 10, 2024, 11:12 AM John Casey via dev 
wrote:

> The team at Google recently held an internal hackathon, and my hack
> involved modifying how our ByteBuddy DoFnInvokers work. My hack didn't end
> up going anywhere, but I learned a lot about how our code generation works.
> It turns out we have no documentation or design docs about our code
> generation, so I wrote up what I learned,
>
> Please take a look, and let me know if I got anything wrong, or if you are
> looking for more detail
>
> s.apache.org/beam-bytebuddy-dofninvoker
>
> John
>


Beam 2.54.0 Release

2024-01-10 Thread Robert Burke
Hey everyone, Happy New Year!
The next release (2.54.0) branch cut is scheduled for Jan 24, 2024, 2 weeks
from today, according to the release calendar [1]. I'd like to perform this
release; I will cut the branch on that date, and cherrypick
remaining release-blocking fixes afterwards, if any. Please help with the
release by: - Making sure that any unresolved release blocking issues have
their "Milestone" marked as "2.54.0 Release" as soon as possible. -
Reviewing the current release blockers [2] and remove the Milestone if they
don't meet the criteria at [3]. There are currently 8 release blockers. Let
me know if you have any comments/objections/questions. Thanks, Robert Burke
[1]
https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com
[2] https://github.com/apache/beam/milestone/18 [3]
https://beam.apache.org/contribute/release-blocking/


Re: [RESULT] [VOTE] Release 2.53.0, release candidate #2

2024-01-05 Thread Robert Burke
Done!

On Fri, Jan 5, 2024, 11:30 AM Robert Burke  wrote:

> Going to try to get this done. Will report back when completed (or I get
> pulled elsewhere).
>
> On Thu, Jan 4, 2024, 11:23 AM Jack McCluskey via dev 
> wrote:
>
>> Hey everyone,
>>
>> Following up on this, I do need help from a PMC member for the PMC-only
>> finalization steps (
>> https://github.com/apache/beam/blob/master/contributor-docs/release-guide.md#pmc-only-finalization
>> )
>>
>> Thanks,
>>
>> Jack McCluskey
>>
>> On Thu, Jan 4, 2024 at 10:27 AM Jack McCluskey 
>> wrote:
>>
>>> Hey everyone,
>>>
>>> I'm happy to announce that we have unanimously approved this release.
>>>
>>> There are nine approving votes, three of which are binding:
>>> * Jan Lukavský (binding)
>>> * Chamikara Jayalath (binding)
>>> * Robert Burke (binding)
>>> * XQ Hu
>>> * Danny McCormick
>>> * Bruno Volpato
>>> * Svetak Sundhar
>>> * Yi Hu
>>> * Johanna Öjeling
>>>
>>> There are no disapproving votes. I will begin finalizing the release.
>>>
>>> Thanks everyone!
>>>
>>> --
>>>
>>>
>>> Jack McCluskey
>>> SWE - DataPLS PLAT/ Dataflow ML
>>> RDU
>>> jrmcclus...@google.com
>>>
>>>
>>>


Re: [RESULT] [VOTE] Release 2.53.0, release candidate #2

2024-01-05 Thread Robert Burke
Going to try to get this done. Will report back when completed (or I get
pulled elsewhere).

On Thu, Jan 4, 2024, 11:23 AM Jack McCluskey via dev 
wrote:

> Hey everyone,
>
> Following up on this, I do need help from a PMC member for the PMC-only
> finalization steps (
> https://github.com/apache/beam/blob/master/contributor-docs/release-guide.md#pmc-only-finalization
> )
>
> Thanks,
>
> Jack McCluskey
>
> On Thu, Jan 4, 2024 at 10:27 AM Jack McCluskey 
> wrote:
>
>> Hey everyone,
>>
>> I'm happy to announce that we have unanimously approved this release.
>>
>> There are nine approving votes, three of which are binding:
>> * Jan Lukavský (binding)
>> * Chamikara Jayalath (binding)
>> * Robert Burke (binding)
>> * XQ Hu
>> * Danny McCormick
>> * Bruno Volpato
>> * Svetak Sundhar
>> * Yi Hu
>> * Johanna Öjeling
>>
>> There are no disapproving votes. I will begin finalizing the release.
>>
>> Thanks everyone!
>>
>> --
>>
>>
>> Jack McCluskey
>> SWE - DataPLS PLAT/ Dataflow ML
>> RDU
>> jrmcclus...@google.com
>>
>>
>>


Re: [VOTE] Release 2.53.0, release candidate #2

2024-01-03 Thread Robert Burke
+1 (binding)

Validated the Go SDK against my own pipleines.

Robert Burke

On Wed, Jan 3, 2024, 7:52 AM Chamikara Jayalath via dev 
wrote:

> +1 (binding)
>
> Validated Java/Python x-lang jobs.
>
> - Cham
>
> On Tue, Jan 2, 2024 at 7:35 AM Jack McCluskey via dev 
> wrote:
>
>> Happy New Year, everyone!
>>
>> Now that we're through the holidays I just wanted to bump the voting
>> thread so we can keep the RC moving.
>>
>> Thanks,
>>
>> Jack McCluskey
>>
>> On Fri, Dec 29, 2023 at 11:58 AM Johanna Öjeling via dev <
>> dev@beam.apache.org> wrote:
>>
>>> +1 (non-binding).
>>>
>>> Tested Go SDK with Dataflow on own use cases.
>>>
>>> On Fri, Dec 29, 2023 at 2:57 AM Yi Hu via dev 
>>> wrote:
>>>
>>>> +1 (non-binding)
>>>>
>>>> Tested with Beam GCP IOs benchmarking (
>>>> https://github.com/GoogleCloudPlatform/DataflowTemplates/tree/main/it/google-cloud-platform
>>>> )
>>>>
>>>> On Thu, Dec 28, 2023 at 11:36 AM Svetak Sundhar via dev <
>>>> dev@beam.apache.org> wrote:
>>>>
>>>>> +1 (non binding)
>>>>>
>>>>> Tested with Healthcare notebooks.
>>>>>
>>>>>
>>>>> Svetak Sundhar
>>>>>
>>>>>   Data Engineer
>>>>> s vetaksund...@google.com
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Dec 28, 2023 at 3:52 AM Jan Lukavský  wrote:
>>>>>
>>>>>> +1 (binding)
>>>>>>
>>>>>> Tested Java SDK with Flink Runner.
>>>>>>
>>>>>>  Jan
>>>>>> On 12/27/23 14:13, Danny McCormick via dev wrote:
>>>>>>
>>>>>> +1 (non-binding)
>>>>>>
>>>>>> Tested with some example ML notebooks.
>>>>>>
>>>>>> Thanks,
>>>>>> Danny
>>>>>>
>>>>>> On Tue, Dec 26, 2023 at 6:41 PM XQ Hu via dev 
>>>>>> wrote:
>>>>>>
>>>>>>> +1 (non-binding)
>>>>>>>
>>>>>>> Tested with the simple RunInference pipeline:
>>>>>>> https://github.com/google/dataflow-ml-starter/actions/runs/7332832875/job/19967521369
>>>>>>>
>>>>>>> On Tue, Dec 26, 2023 at 3:29 PM Jack McCluskey via dev <
>>>>>>> dev@beam.apache.org> wrote:
>>>>>>>
>>>>>>>> Happy holidays everyone,
>>>>>>>>
>>>>>>>> Please review and vote on the release candidate #2 for the version
>>>>>>>> 2.53.0, as follows:
>>>>>>>>
>>>>>>>> [ ] +1, Approve the release
>>>>>>>> [ ] -1, Do not approve the release (please provide specific
>>>>>>>> comments)
>>>>>>>>
>>>>>>>> Reviewers are encouraged to test their own use cases with the
>>>>>>>> release candidate, and vote +1 if no issues are found. Only PMC member
>>>>>>>> votes will count towards the final vote, but votes from all community
>>>>>>>> members are encouraged and helpful for finding regressions; you can 
>>>>>>>> either
>>>>>>>> test your own use cases [13] or use cases from the validation sheet 
>>>>>>>> [10].
>>>>>>>>
>>>>>>>> The complete staging area is available for your review, which
>>>>>>>> includes:
>>>>>>>> * GitHub Release notes [1],
>>>>>>>> * the official Apache source release to be deployed to
>>>>>>>> dist.apache.org [2], which is signed with the key with fingerprint
>>>>>>>> DF3CBA4F3F4199F4 (D20316F712213422 if automated) [3],
>>>>>>>> * all artifacts to be deployed to the Maven Central Repository [4],
>>>>>>>> * source code tag "v1.2.3-RC3" [5],
>>>>>>>> * website pull request listing the release [6], the blog post [6],
>>>>>>>> and publishing the API reference manual [7].
>>>>>>>> * Python artifacts are deployed along with the source release to
>>>>>>>> the dist.apache.org [2] and PyPI[8].
>>>>&g

Re: How do side inputs relate to stage fusion?

2023-12-15 Thread Robert Burke
That would do it. We got so tunnel visioned on side inputs we missed that!

IIRC the python local runner and Prism both only fuse transforms in
identical environments together. So any environmental diffs will prevent
fusion.

Runners as a rule are usually free to ignore/manage hints as they like.
Transform annotations might be an alternative, but how those are managed
would be more SDK specific.

On Fri, Dec 15, 2023, 5:21 AM Joey Tran  wrote:

> I figured out my issue. I thought side inputs were breaking up my pipeline
> but after experimenting with my transforms I now realize what was actually
> breaking it up was different transform environments that weren't considered
> compatible.
>
> We have a custom resource hint (for specifying whether a transform needs
> access to some software license) that we use with our transforms and that's
> what was preventing the fusion I was expecting. I'm I'm looking into how to
> make these hints mergeable now.
>
> On Thu, Dec 14, 2023 at 7:46 PM Robert Burke  wrote:
>
>> Building on what Robert Bradshaw has said, basically, if these fusion
>> breaks don't exist, the pipeline can live lock, because the side input is
>> unable to finish computing for a given input element's window.
>>
>> I have recently added fusion to the Go Prism runner based on the python
>> side input semantics, and i was surprised that there are basically two
>> rules for fusion. The side input one, and for handling Stateful processing.
>>
>>
>> This code here is the greedy fusion algorithm that Python uses, but a
>> less set based, so it might be easier to follow:
>> https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/runners/prism/internal/preprocess.go#L513
>>
>> From the linked code comment:
>>
>> Side Inputs: A transform S consuming a PCollection as a side input can't
>>  be fused with the transform P that produces that PCollection. Further,
>> no transform S+ descended from S, can be fused with transform P.
>>
>> Ideally I'll add visual representations of the graphs in the test suite
>> here, that validates the side input dependency logic:
>>
>>
>> https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/runners/prism/internal/preprocess_test.go#L398
>>
>> (Note, that test doesn't validate expected fusion results, Prism is a
>> work in progress).
>>
>>
>> As for the Stateful rule, this is largely an implementation convenience
>> for runners to ensure correct execution.
>> If your pipeline also uses Stateful transforms, or SplittableDoFns, those
>> are usually relegated to the root of a fused stage, and avoids fusions with
>> each other. That can also cause additional stages.
>>
>> If Beam adopted a rigorous notion of Key Preserving for transforms,
>> multiple stateful transforms could be fused in the same stage. But that's a
>> very different discussion.
>>
>> On Thu, Dec 14, 2023, 4:03 PM Joey Tran 
>> wrote:
>>
>>> Thanks for the explanation!
>>>
>>> That matches with my intuition - are there any other rules with side
>>> inputs?
>>>
>>> I might be misunderstanding the actual cause of the fusion breaks in our
>>> pipeline, but we essentially have one part of the graph that produces many
>>> small collections that are used as side inputs in the remaining part of the
>>> graph. In other words, the "main graph" is mostly linear but uses side
>>> inputs from the earlier part of the graph.
>>>
>>>  Since the main graph is mostly linear, I expected few stages, but what
>>> I actually see are a lot of breaks around the side input requiring
>>> transforms.
>>>
>>>
>>> Tangentially, are there any general tips for understanding why a graph
>>> might be fused the way it was?
>>>
>>> On Thu, Dec 14, 2023, 6:10 PM Robert Bradshaw via dev <
>>> dev@beam.apache.org> wrote:
>>>
>>>> That is correct. Side inputs give a view of the "whole" PCollection and
>>>> hence introduce a fusion-producing barrier. For example, suppose one has a
>>>> DoFn that produces two outputs, mainPColl and sidePColl, that are consumed
>>>> (as the main and side input respectively) of DoFnB.
>>>>
>>>>    mainPColl - DoFnB
>>>> /^
>>>> inPColl -- DoFnA |
>>>> \|
>>>>    sidePColl --- /
>>>>
>>>>
>>

Re: How do side inputs relate to stage fusion?

2023-12-14 Thread Robert Burke
Building on what Robert Bradshaw has said, basically, if these fusion
breaks don't exist, the pipeline can live lock, because the side input is
unable to finish computing for a given input element's window.

I have recently added fusion to the Go Prism runner based on the python
side input semantics, and i was surprised that there are basically two
rules for fusion. The side input one, and for handling Stateful processing.


This code here is the greedy fusion algorithm that Python uses, but a less
set based, so it might be easier to follow:
https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/runners/prism/internal/preprocess.go#L513

>From the linked code comment:

Side Inputs: A transform S consuming a PCollection as a side input can't
 be fused with the transform P that produces that PCollection. Further,
no transform S+ descended from S, can be fused with transform P.

Ideally I'll add visual representations of the graphs in the test suite
here, that validates the side input dependency logic:

https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/runners/prism/internal/preprocess_test.go#L398

(Note, that test doesn't validate expected fusion results, Prism is a work
in progress).


As for the Stateful rule, this is largely an implementation convenience for
runners to ensure correct execution.
If your pipeline also uses Stateful transforms, or SplittableDoFns, those
are usually relegated to the root of a fused stage, and avoids fusions with
each other. That can also cause additional stages.

If Beam adopted a rigorous notion of Key Preserving for transforms,
multiple stateful transforms could be fused in the same stage. But that's a
very different discussion.

On Thu, Dec 14, 2023, 4:03 PM Joey Tran  wrote:

> Thanks for the explanation!
>
> That matches with my intuition - are there any other rules with side
> inputs?
>
> I might be misunderstanding the actual cause of the fusion breaks in our
> pipeline, but we essentially have one part of the graph that produces many
> small collections that are used as side inputs in the remaining part of the
> graph. In other words, the "main graph" is mostly linear but uses side
> inputs from the earlier part of the graph.
>
>  Since the main graph is mostly linear, I expected few stages, but what I
> actually see are a lot of breaks around the side input requiring transforms.
>
>
> Tangentially, are there any general tips for understanding why a graph
> might be fused the way it was?
>
> On Thu, Dec 14, 2023, 6:10 PM Robert Bradshaw via dev 
> wrote:
>
>> That is correct. Side inputs give a view of the "whole" PCollection and
>> hence introduce a fusion-producing barrier. For example, suppose one has a
>> DoFn that produces two outputs, mainPColl and sidePColl, that are consumed
>> (as the main and side input respectively) of DoFnB.
>>
>>    mainPColl - DoFnB
>> /^
>> inPColl -- DoFnA |
>> \|
>>    sidePColl --- /
>>
>>
>> Now DoFnB may iterate over the entity of sidePColl for every element of
>> mainPColl. This means that DoFnA and DoFnB cannot be fused, which
>> would require DoFnB to consume the elements as they are produced from
>> DoFnA, but we need DoFnA to run to completion before we know the contents
>> of sidePColl.
>>
>> Similar constraints apply in larger graphs (e.g. there may be many
>> intermediate DoFns and PCollections), but they principally boil down to
>> shapes that look like this.
>>
>> Though this does not introduce a global barrier in streaming, there is
>> still the analogous per window/watermark barrier that prevents fusion for
>> the same reasons.
>>
>>
>>
>>
>> On Thu, Dec 14, 2023 at 3:02 PM Joey Tran 
>> wrote:
>>
>>> Hey all,
>>>
>>> We have a pretty big pipeline and while I was inspecting the stages, I
>>> noticed there is less fusion than I expected. I suspect it has to do with
>>> the heavy use of side inputs in our workflow. In the python sdk, I see that
>>> side inputs are considered when determining whether two stages are fusible.
>>> I have a hard time getting a clear understanding of the logic though. Could
>>> someone clarify / summarize the rules around this?
>>>
>>> Thanks!
>>> Joey
>>>
>>


Re: Beam 2.53.0 Release

2023-11-29 Thread Robert Burke
Thanks Jack!

On Wed, Nov 29, 2023, 10:01 AM Jack McCluskey via dev 
wrote:

> Hey everyone, the next release (2.53.0) branch cut is scheduled for Dec 13,
> 2023, 2 weeks from today, according to the release calendar [1]. I'd like
> to perform this release; I will cut the branch on that date, and cherrypick
> release-blocking fixes afterwards, if any.
>
>
> Please help with the release by:
>
>
> - Making sure that any unresolved release blocking issues have their
> "Milestone" marked as "2.53.0 Release" as soon as possible.
>
> - Reviewing the current release blockers [2] and remove the Milestone if
> they don't meet the criteria at [3]. There are currently 12 release
> blockers.
>
>
> Let me know if you have any comments/objections/questions.
>
>
> Thanks,
>
> Jack
>
>
> [1]
>
> https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com
>
> [2] https://github.com/apache/beam/milestone/17
>
> [3] https://beam.apache.org/contribute/release-blocking/
>
> --
>
>
> Jack McCluskey
> SWE - DataPLS PLAT/ Dataflow ML
> RDU
> jrmcclus...@google.com
>
>
>


Re: [YAML] Aggregations

2023-10-29 Thread Robert Burke
I came across Edge DB, and it has a novel syntax moving away from SQL with
their EdgeQL.

https://www.edgedb.com/

Eg. Heere are two equivalent "nested" queries.


# EdgeQL

select Movie {
  title,
  actors: {
   name
  },
  rating := math::mean(.reviews.score)
} filter "Zendaya" in .actors.name;


# SQL

SELECT
  title,
  Actors.name AS actor_name,
  (SELECT avg(score)
FROM Movie_Reviews
WHERE movie_id = Movie.id) AS rating
FROM
  Movie
  LEFT JOIN Movie_Actors ON
Movie.id = Movie_Actors.movie_id
  LEFT JOIN Person AS Actors ON
Movie_Actors.person_id = Person.id
WHERE
  'Zendaya' IN (
SELECT Person.name
FROM
  Movie_Actors
  INNER JOIN Person
ON Movie_Actors.person_id = Person.id
WHERE
  Movie_Actors.movie_id = Movie.id)


The key observations here are specifics around join kinds and stuff don't
often need to be directly expressed in the query.

I'd need to dig deeper around it (such as do they share... ) but it does do
a nice first impression of demos.


On Mon, Oct 23, 2023, 7:00 AM XQ Hu via dev  wrote:

> +1 on your proposal.
>
> On Fri, Oct 20, 2023 at 4:59 PM Robert Bradshaw via dev <
> dev@beam.apache.org> wrote:
>
>> On Fri, Oct 20, 2023 at 11:35 AM Kenneth Knowles  wrote:
>> >
>> > A couple other bits on having an expression language:
>> >
>> >  - You already have Python lambdas at places, right? so that's quite a
>> lot more complex than SQL project/aggregate expressions
>> >  - It really does save a lot of pain for users (at the cost of
>> implementation complexity) when you need to "SUM(col1*col2)" where
>> otherwise you have to Map first. This could be viewed as desirable as well,
>> of course.
>> >
>> > Anyhow I'm pretty much in agreement with all your reasoning as to why
>> *not* to use SQL-like expressions in strings. But it does seem odd when
>> juxtaposed with Python snippets.
>>
>> Well, we say "here's a Python expression" when we're using a Python
>> string. But "SUM(col1*col2)" isn't as transparent. (Agree about the
>> niceties of being able to provide an expression rather than a column.)
>>
>> > On Thu, Oct 19, 2023 at 4:00 PM Robert Bradshaw via dev <
>> dev@beam.apache.org> wrote:
>> >>
>> >> On Thu, Oct 19, 2023 at 12:53 PM Reuven Lax  wrote:
>> >> >
>> >> > Is the schema Group transform (in Java) something along these lines?
>> >>
>> >> Yes, for sure it is. It (and Python's and Typescript's equivalent) are
>> >> linked in the original post. The open question is how to best express
>> >> this in YAML.
>> >>
>> >> > On Wed, Oct 18, 2023 at 1:11 PM Robert Bradshaw via dev <
>> dev@beam.apache.org> wrote:
>> >> >>
>> >> >> Beam Yaml has good support for IOs and mappings, but one key missing
>> >> >> feature for even writing a WordCount is the ability to do
>> Aggregations
>> >> >> [1]. While the traditional Beam primitive is GroupByKey (and
>> >> >> CombineValues), we're eschewing KVs in the notion of more schema'd
>> >> >> data (which has some precedence in our other languages, see the
>> links
>> >> >> below). The key components the user needs to specify are (1) the key
>> >> >> fields on which the grouping will take place, (2) the fields
>> >> >> (expressions?) involved in the aggregation, and (3) what aggregating
>> >> >> fn to use.
>> >> >>
>> >> >> A straw-man example could be something like
>> >> >>
>> >> >> type: Aggregating
>> >> >> config:
>> >> >>   key: [field1, field2]
>> >> >>   aggregating:
>> >> >> total_cost:
>> >> >>   fn: sum
>> >> >>   value: cost
>> >> >> max_cost:
>> >> >>   fn: max
>> >> >>   value: cost
>> >> >>
>> >> >> This would basically correspond to the SQL expression
>> >> >>
>> >> >> "SELECT field1, field2, sum(cost) as total_cost, max(cost) as
>> max_cost
>> >> >> from table GROUP BY field1, field2"
>> >> >>
>> >> >> (though I'm not requiring that we use this as an implementation
>> >> >> strategy). I do not think we need a separate (non aggregating)
>> >> >> Grouping operation, this can be accomplished by having a
>> concat-style
>> >> >> combiner.
>> >> >>
>> >> >> There are still some open questions here, notably around how to
>> >> >> specify the aggregation fns themselves. We could of course provide a
>> >> >> number of built-ins (like SQL does). This gets into the question of
>> >> >> how and where to document this complete set, but some basics should
>> >> >> take us pretty far. Many aggregators, however, are parameterized
>> (e.g.
>> >> >> quantiles); where do we put the parameters? We could go with
>> something
>> >> >> like
>> >> >>
>> >> >> fn:
>> >> >>   type: ApproximateQuantiles
>> >> >>   config:
>> >> >> n: 10
>> >> >>
>> >> >> but others are even configured by functions themselves (e.g.
>> LargestN
>> >> >> that wants a comparator Fn). Maybe we decide not to support these
>> >> >> (yet?)
>> >> >>
>> >> >> One thing I think we should support, however, is referencing custom
>> >> >> CombineFns. We have some precedent for this with our Fns from
>> >> >> 

Re: Streaming update compatibility

2023-10-27 Thread Robert Burke
On Fri, Oct 27, 2023, 9:09 AM Robert Bradshaw via dev 
wrote:

> On Fri, Oct 27, 2023 at 7:50 AM Kellen Dye via dev 
> wrote:
> >
> > > Auto is hard, because it would involve
> > > querying the runner before pipeline construction, and we may not even
> > > know what the runner is at this point
> >
> > At the point where pipeline construction will start, you should have
> access to the pipeline arguments and be able to determine the runner. What
> seems to be missing is a place to query the runner pre-construction. If
> that query could return metadata about the currently running version of the
> job, then that could be incorporated into graph construction as necessary.
>
> While this is the common case, it is not true in general. For example
> it's possible to cache the pipeline proto and submit it to a separate
> choice of runner later. We have Jobs API implementations that
> forward/proxy the job to other runners, and the Python interactive
> runner is another example where the runner is late-binding (e.g. one
> tries a sample locally, and if all looks good can execute remotely,
> and also in this case the graph that's submitted is often mutated
> before running).
>
> Also, in the spirit of the portability story, the pipeline definition
> itself should be runner-independent.
>
> > That same hook could be a place to for example return the
> currently-running job graph for pre-submission compatibility checks.
>
> I suppose we could add something to the Jobs API to make "looking up a
> previous version of this pipeline" runner-agnostic, though that
> assumes it's available at construction time.


As I pointed out,  we can access a given pipeline via the job management
API. It's already runner agnostic other than Dataflow.

Operationally though, we'd need to provide the option to "dry run" an
update locally, or validate update compatibility against a given pipeline
proto.

And +1 as Kellen says we

> should define (and be able to check) what pipeline compatibility means
> in a via graph-to-graph comparison at the Beam level. I'll defer both
> of these as future work as part of the "make update a portable Beam
> concept" project.
>

Big +1 to that. Hard to know what to check for without defining it. This
would avoid needing to ask a given runner WRT dry run updates.

It's on a longer term plan, but I have intended to add Pipeline Update as a
feature to Prism. As it becomes more fully featured, it becomes a great
test bed to develop the definitions.

>


Re: Streaming update compatibility

2023-10-27 Thread Robert Burke
You raise a very good point:


https://github.com/apache/beam/blob/master/model/job-management/src/main/proto/org/apache/beam/model/job_management/v1/beam_job_api.proto#L54

The job management API does allow for the pipeline proto to be returned. So
one could get the live job, so the SDK could make reasonable decisions
before sending to the runner.

Dataflow does have a similar API that xan be adapted.

I am a touch concerned about spreading the update compatibility checks
around between SDKs and Runners though. But in some cases it would be
easier for the SDK, eg to ensure VersionA of a transform is used vs
VersionB, based on the existing transforma used in the job being updated.


On Fri, Oct 27, 2023, 7:50 AM Kellen Dye via dev 
wrote:

> > Auto is hard, because it would involve
> > querying the runner before pipeline construction, and we may not even
> > know what the runner is at this point
>
> At the point where pipeline construction will start, you should have
> access to the pipeline arguments and be able to determine the runner. What
> seems to be missing is a place to query the runner pre-construction. If
> that query could return metadata about the currently running version of the
> job, then that could be incorporated into graph construction as necessary.
>
> That same hook could be a place to for example return the
> currently-running job graph for pre-submission compatibility checks.
>
>
>


Re: Streaming update compatibility

2023-10-26 Thread Robert Burke
Regarding 3. I suspect Go wasn't changed because the PR is centering around
implementations of the Expansion Service server, not client callers. The Go
SDK doesn't yet have an expansion service.

On Thu, Oct 26, 2023, 3:59 AM Johanna Öjeling via dev 
wrote:

> Hi,
>
> I like this idea of making it easier to push out improvements, and had a
> look at the PR.
>
> One question to better understand how it works today:
>
>1. The upgrades that the runners do, such as those not visible to the
>user, can they be initiated at any time or do they only happen in relation
>to that the user updates the running pipeline e.g. with new user code?
>
> And, assuming the former, some reflections that came to mind when
> reviewing the changes:
>
>1. Will the update_compatibility_version option be effective both when
>creating and updating a pipeline? It is grouped with the update options in
>the Python SDK, but users may want to configure the compatibility already
>when launching the pipeline.
>2. Would it be possible to revert setting a fixed prior version, i.e.
>(re-)enable upgrades?
>   1. If yes: in practice, would this motivate another option, or
>   passing a value like "auto" or "latest" to update_compatibility_version?
>3. The option is being introduced to the Java and Python SDKs. Should
>this also be applicable to the Go SDK?
>
> Thanks,
> Johanna
>
> On Thu, Oct 26, 2023 at 2:25 AM Robert Bradshaw via dev <
> dev@beam.apache.org> wrote:
>
>> Dataflow (among other runners) has the ability to "upgrade" running
>> pipelines with new code (e.g. capturing bug fixes, dependency updates,
>> and limited topology changes). Unfortunately some improvements (e.g.
>> new and improved ways of writing to BigQuery, optimized use of side
>> inputs, a change in algorithm, sometimes completely internally and not
>> visible to the user) are not sufficiently backwards compatible which
>> causes us, with the motivation to not break users, to either not make
>> these changes or guard them as a parallel opt-in mode which is a
>> significant drain on both developer productivity and causes new
>> pipelines to run in obsolete modes by default.
>>
>> I created https://github.com/apache/beam/pull/29140 which adds a new
>> pipeline option, update_compatibility_version, that allows the SDK to
>> move forward while letting users with pipelines launched previously to
>> manually request the "old" way of doing things to preserve update
>> compatibility. (We should still attempt backwards compatibility when
>> it makes sense, and the old way would remain in code until such a time
>> it's actually deprecated and removed, but this means we won't be
>> constrained by it, especially when it comes to default settings.)
>>
>> Any objections or other thoughts on this approach?
>>
>> - Robert
>>
>> P.S. Separately I think it'd be valuable to elevate the vague notion
>> of update compatibility to a first-class Beam concept and put it on
>> firm footing, but that's a larger conversation outside the thread of
>> this smaller (and I think still useful in such a future world) change.
>>
>


Re: [YAML] Aggregations

2023-10-18 Thread Robert Burke
MongoDB has its own concept of aggregation pipelines as well.

https://www.mongodb.com/docs/manual/core/aggregation-pipeline/#std-label-aggregation-pipeline


On Wed, Oct 18, 2023, 6:07 PM Robert Bradshaw via dev 
wrote:

> On Wed, Oct 18, 2023 at 5:06 PM Byron Ellis  wrote:
> >
> > Is it worth taking a look at similar prior art in the space?
>
> +1. Pointers welcome.
>
> > The first one that comes to mind is Transform, but with the dbt labs
> acquisition that spec is a lot harder to find. Rill is pretty similar
> though.
>
> Rill seems to be very SQL-based.
>
> > On Wed, Oct 18, 2023 at 1:12 PM Robert Bradshaw via dev <
> dev@beam.apache.org> wrote:
> >>
> >> Beam Yaml has good support for IOs and mappings, but one key missing
> >> feature for even writing a WordCount is the ability to do Aggregations
> >> [1]. While the traditional Beam primitive is GroupByKey (and
> >> CombineValues), we're eschewing KVs in the notion of more schema'd
> >> data (which has some precedence in our other languages, see the links
> >> below). The key components the user needs to specify are (1) the key
> >> fields on which the grouping will take place, (2) the fields
> >> (expressions?) involved in the aggregation, and (3) what aggregating
> >> fn to use.
> >>
> >> A straw-man example could be something like
> >>
> >> type: Aggregating
> >> config:
> >>   key: [field1, field2]
> >>   aggregating:
> >> total_cost:
> >>   fn: sum
> >>   value: cost
> >> max_cost:
> >>   fn: max
> >>   value: cost
> >>
> >> This would basically correspond to the SQL expression
> >>
> >> "SELECT field1, field2, sum(cost) as total_cost, max(cost) as max_cost
> >> from table GROUP BY field1, field2"
> >>
> >> (though I'm not requiring that we use this as an implementation
> >> strategy). I do not think we need a separate (non aggregating)
> >> Grouping operation, this can be accomplished by having a concat-style
> >> combiner.
> >>
> >> There are still some open questions here, notably around how to
> >> specify the aggregation fns themselves. We could of course provide a
> >> number of built-ins (like SQL does). This gets into the question of
> >> how and where to document this complete set, but some basics should
> >> take us pretty far. Many aggregators, however, are parameterized (e.g.
> >> quantiles); where do we put the parameters? We could go with something
> >> like
> >>
> >> fn:
> >>   type: ApproximateQuantiles
> >>   config:
> >> n: 10
> >>
> >> but others are even configured by functions themselves (e.g. LargestN
> >> that wants a comparator Fn). Maybe we decide not to support these
> >> (yet?)
> >>
> >> One thing I think we should support, however, is referencing custom
> >> CombineFns. We have some precedent for this with our Fns from
> >> MapToFields, where we accept things like inline lambdas and external
> >> references. Again the topic of how to configure them comes up, as
> >> these custom Fns are more likely to be parameterized than Map Fns
> >> (though, to be clear, perhaps it'd be good to allow parameterizatin of
> >> MapFns as well). Maybe we allow
> >>
> >> language: python. # like MapToFields (and here it'd be harder to mix
> >> and match per Fn)
> >> fn:
> >>   type: ???
> >>   # should these be nested as config?
> >>   name: fully.qualiied.name
> >>   path: /path/to/defining/file
> >>   args: [...]
> >>   kwargs: {...}
> >>
> >> which would invoke the constructor.
> >>
> >> I'm also open to other ways of naming/structuring these essential
> >> parameters if it makes things more clear.
> >>
> >> - Robert
> >>
> >>
> >> Java:
> https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/schemas/transforms/Group.html
> >> Python:
> https://beam.apache.org/documentation/transforms/python/aggregation/groupby
> >> Typescript:
> https://beam.apache.org/releases/typedoc/current/classes/transforms_group_and_combine.GroupBy.html
> >>
> >> [1] One can of course use SqlTransform for this, but I'm leaning
> >> towards offering something more native.
>


Re: [Question] Bundle finalization callback

2023-10-15 Thread Robert Burke
I would recommend avoiding over fitting to a specific runner, but given the
constraints, I'd say being eager about self checkpointing and ensuring fine
grain splits. This will allow runners to schedule more bundles in parallel
if they are able, and provide independence between them.

Part of the issue is that the downstream transforms will eat into that ack
deadline time as well. 30s is all the time to pull the message, process it
and any children downstream of the Read, and so on.

Dataflow biases towards small bundles during streaming execution, but
setting short Process Continuation suggestions should allow for low latency.

All that said, 30s sounds fairly short for an ack timeout (knowing little
about the specific source you're adding). I know that Google Cloud PubSub
auto-extends ack deadlines as long as the client connection remains open.
This is done automatically by the client itself. That's an alternative
possibility as well if the datasource supports it: manually extended the
ack deadline until the bundle completes normally, and then allowing
finalization to happen. (Balanced with how much state stays in memory and
so on).


On Sun, Oct 15, 2023, 1:30 PM Johanna Öjeling  wrote:

> Okay I see, thank you for your quick reply! I'll have a look into that
> file.
>
> Do you have an idea of on which interval I could expect the Dataflow
> runner to initiate the finalization? Thinking of the case where I have a
> message ack deadline of e.g. 30s and a continuous stream of messages that
> keeps the ProcessElement active. Then I will want to interrupt processing
> of new messages and self-checkpoint before those 30s have passed, if the
> runner hasn't initiated it within that time frame.
>
> Johanna
>
> On Sun, Oct 15, 2023, 21:13 Robert Burke  wrote:
>
>> Hi! Nswers inline.
>>
>> On Sun, Oct 15, 2023, 11:48 AM Johanna Öjeling via dev <
>> dev@beam.apache.org> wrote:
>>
>>> Hi,
>>>
>>> I'm working on a native streaming IO connector for the Go SDK to enable
>>> reads and writes from/to NATS (#29000
>>> <https://github.com/apache/beam/issues/29000>) and would like to better
>>> understand how bundle finalization works.
>>>
>>> For this use case I need to register a callback function which
>>> acknowledges the processed messages.
>>>
>>> More concretely, I wonder:
>>>
>>>- When/with which interval is the callback function invoked? Is it
>>>supposed to happen at every runner and SDF initiated checkpoint?
>>>
>>> A runner that supports it should be calling back to the SDK for
>> finalization after a bundle is completed, regardless of how the bundle
>> terminates successfully. Eg. Via splits, or completed data, or process
>> continuation resume/stop.
>>
>>
>>>- Is it correct that the registered callback is called and retried
>>>on a best effort? What is the estimated success rate for this best 
>>> effort?
>>>
>>> I don't believe the callback would be retried if it failed, since
>> technically the point is to callback after the runner has committed the
>> bundles output successfully. A failure in finalization would require
>> retractions, essentially.
>>
>> It is best effort, since a callback could expire before bundle execution
>> completes.
>>
>>>
>>> When running a WIP example pipeline
>>> <https://github.com/johannaojeling/beam/blob/cfa6babbd0c9b1196c186974165beb7559a226db/sdks/go/examples/natsread/main.go>
>>> on Dataflow the callback function is only ever called once but ignored
>>> after subsequent checkpoints so thereby my questions.
>>>
>>
>> That is very odd. My understanding is that it should be after every
>> complete bundle, if the stage contains a registered callback prior to it's
>> expiration time.
>>
>> The SDK code has a lot of time.Now() checks, so depending if the
>> expiration time is set relative to processing time, it could be mistakenly
>> dropping the valid callback.
>>
>> It does look like failed callbacks are re-queued after being called;
>> (exec/plan.go:188), but only if they haven't yet expired. Expired callbacks
>> are never run.
>>
>> I'm not sure what the correct runner side behavior is though.
>>
>>
>>> If anyone could help clarify the above or point me towards some
>>> documentation that would be much appreciated!
>>>
>>> Thanks,
>>> Johanna
>>>
>>>


Re: [Question] Bundle finalization callback

2023-10-15 Thread Robert Burke
Hi! Nswers inline.

On Sun, Oct 15, 2023, 11:48 AM Johanna Öjeling via dev 
wrote:

> Hi,
>
> I'm working on a native streaming IO connector for the Go SDK to enable
> reads and writes from/to NATS (#29000
> ) and would like to better
> understand how bundle finalization works.
>
> For this use case I need to register a callback function which
> acknowledges the processed messages.
>
> More concretely, I wonder:
>
>- When/with which interval is the callback function invoked? Is it
>supposed to happen at every runner and SDF initiated checkpoint?
>
> A runner that supports it should be calling back to the SDK for
finalization after a bundle is completed, regardless of how the bundle
terminates successfully. Eg. Via splits, or completed data, or process
continuation resume/stop.


>- Is it correct that the registered callback is called and retried on
>a best effort? What is the estimated success rate for this best effort?
>
> I don't believe the callback would be retried if it failed, since
technically the point is to callback after the runner has committed the
bundles output successfully. A failure in finalization would require
retractions, essentially.

It is best effort, since a callback could expire before bundle execution
completes.

>
> When running a WIP example pipeline
> 
> on Dataflow the callback function is only ever called once but ignored
> after subsequent checkpoints so thereby my questions.
>

That is very odd. My understanding is that it should be after every
complete bundle, if the stage contains a registered callback prior to it's
expiration time.

The SDK code has a lot of time.Now() checks, so depending if the expiration
time is set relative to processing time, it could be mistakenly dropping
the valid callback.

It does look like failed callbacks are re-queued after being called;
(exec/plan.go:188), but only if they haven't yet expired. Expired callbacks
are never run.

I'm not sure what the correct runner side behavior is though.


> If anyone could help clarify the above or point me towards some
> documentation that would be much appreciated!
>
> Thanks,
> Johanna
>
>


Re: [PROPOSAL] [Nice-to-have] CI job names and commands that match

2023-10-10 Thread Robert Burke
Conversely, by unifying to Gradle command names, it also teaches how folks
can run these things locally.

Doesn't help entirely with discoverability, or initial scrutability, but it
feels lower impedance than someone needing to look at the action manually
to learn what it's running under the hood (for simple, non compound cases)

On Tue, Oct 10, 2023, 8:49 AM Danny McCormick via dev 
wrote:

> > Just to clarify: I'm not proposing tying them to gradle tasks (I'm fine
> with `go test` for example) or doing this in situations where it is
> unnatural.
>
> > My example probably confused this because I left off the `./gradlew`
> just to save space. I'm proposing naming them after their obvious repro
> command, wherever applicable. Mostly fixing stuff like how the status label
> "Google Cloud Dataflow Runner V2 Java ValidatesRunner Tests (streaming)" is
> literally a less useful way of writing
> ":runners:google-cloud-dataflow-java:validatesRunnerV2Streaming".
>
> > FWIW I think Yi's example demonstrates an anti-pattern (mixing
> hermetic/reliable and non-hermetic/possibly-flaky tests in one test signal)
> but I'm sure there are indeed jobs where this doesn't make sense.
>
> Ah cool - I'm still -1 on this, though less strongly than if it was
> actually tying it to Gradle. Most of these would probably be gradle
> commands anyways as proposed. It is often a good idea to have workflows
> that have multiple consecutive gradle commands (also addressing Robert's
> point here) so there's not a single repro command. A couple examples where
> this might make sense:
>
> 1) A job that runs the same performance test on a remote runner with a
> bunch of different configs. A single gradle task has the downsides of less
> clear observability (which task actually caused the failed build?) and
> resource wasting on failures (early failures keep us from using as many
> resources running multiple possibly expensive jobs).
> 2) A single runner validation suite that runs Java/Python/Go tests for
> that runner. We don't really do this much, but it is a totally reasonable
> (desirable?) way to structure tests.
> 3) An expensive job that has one command to run a small set of tests and
> then another to run a more expensive set only if the initial one passed.
>
> It's also worth mentioning that often the command to run a step is quite
> long, especially for things like perf tests that have lots of flags to pass.
>
> All of these examples *could *be bundled into a single gradle command,
> but they shouldn't have to be. Instead we should have a workflow interface
> that is as independent of implementation as possible IMO and represents an
> abstraction of what we actually want to test (e.g. "Run Dataflow Runner
> Tests", or "Run Java Examples Tests"). This also avoids us taking a
> dependency that could go out of date if we change our commands.
>
> > Mostly fixing stuff like how the status label "Google Cloud Dataflow
> Runner V2 Java ValidatesRunner Tests (streaming)" is literally a less
> useful way of writing
> ":runners:google-cloud-dataflow-java:validatesRunnerV2Streaming".
>
> Last thing I'll add - this is true for you and probably many contributors,
> but is less friendly for new folks who are less familiar with the project
> IMO (especially when the filepath/command is less obvious).
>
> On Tue, Oct 10, 2023 at 12:29 PM Robert Burke  wrote:
>
>> +1 to the general proposal.
>>
>> I'm not bothered if something says a gradle command and in execution,
>> that task ends up running multiple different commands. Arguably, if we're
>> running a gradle task manualy to prepare for a subsequent task that latter
>> task should be adding the former to it's dependencies.
>>
>> Also agree that this is a post Jenkins exit task. One migration in an
>> area at a time please.
>>
>> On Tue, Oct 10, 2023, 8:07 AM Kenneth Knowles  wrote:
>>
>>> On Tue, Oct 10, 2023 at 10:21 AM Danny McCormick via dev <
>>> dev@beam.apache.org> wrote:
>>>
>>>> I'm +1 on:
>>>> - standardizing our naming
>>>> - making job names match their commands exactly (though I'd still like
>>>> the `Run` prefix so that you can do things like say "Suite XYZ failed"
>>>> without triggering the automation)
>>>> - removing pre/postcommit from the naming (we actually already run many
>>>> precommits as postcommits as well)
>>>
>>>
>>> Fully agree with your point of keeping "Run" as the magic word and the
>>> way we have it today.
>>>
>>> I'm -0 on:
>>>
>>>> - Doing this immed

Re: [PROPOSAL] [Nice-to-have] CI job names and commands that match

2023-10-10 Thread Robert Burke
+1 to the general proposal.

I'm not bothered if something says a gradle command and in execution, that
task ends up running multiple different commands. Arguably, if we're
running a gradle task manualy to prepare for a subsequent task that latter
task should be adding the former to it's dependencies.

Also agree that this is a post Jenkins exit task. One migration in an area
at a time please.

On Tue, Oct 10, 2023, 8:07 AM Kenneth Knowles  wrote:

> On Tue, Oct 10, 2023 at 10:21 AM Danny McCormick via dev <
> dev@beam.apache.org> wrote:
>
>> I'm +1 on:
>> - standardizing our naming
>> - making job names match their commands exactly (though I'd still like
>> the `Run` prefix so that you can do things like say "Suite XYZ failed"
>> without triggering the automation)
>> - removing pre/postcommit from the naming (we actually already run many
>> precommits as postcommits as well)
>
>
> Fully agree with your point of keeping "Run" as the magic word and the way
> we have it today.
>
> I'm -0 on:
>
>> - Doing this immediately - I'd prefer we wait til the Jenkins to Actions
>> migration is done and we can do this in bulk versus renaming things as we
>> go since we're so close to the finish line and exact parity makes reviews
>> easier.
>
>
> Cool. And indeed this is why I didn't "just do it' (aside from this being
> enough impact to people's daily lives that I wanted to get feedback from
> dev@). If we can fold it in as a last step to the migration, that would
> be a nice-to-have. Otherwise ping back when ready please :-)
>
> On Tue, Oct 10, 2023 at 11:15 AM Yi Hu via dev 
> wrote:
>
>> Thanks for raising this. This generally works, though some jobs run more
>> than one gradle task (e.g. some IO_Direct_PreCommit run both :build (which
>> executes unit tests) and :integrationTest).
>>
>
> On Tue, Oct 10, 2023 at 10:21 AM Danny McCormick via dev <
> dev@beam.apache.org> wrote:
>
>> I'm -1 on:
>> - Tying the naming to gradle - like Yi mentioned some workflows have
>> multiple gradle tasks, some don't have any, and I think that's ok.
>>
>
> Just to clarify: I'm not proposing tying them to gradle tasks (I'm fine
> with `go test` for example) or doing this in situations where it is
> unnatural.
>
> My example probably confused this because I left off the `./gradlew` just
> to save space. I'm proposing naming them after their obvious repro command,
> wherever applicable. Mostly fixing stuff like how the status label "Google
> Cloud Dataflow Runner V2 Java ValidatesRunner Tests (streaming)" is
> literally a less useful way of writing
> ":runners:google-cloud-dataflow-java:validatesRunnerV2Streaming".
>
> FWIW I think Yi's example demonstrates an anti-pattern (mixing
> hermetic/reliable and non-hermetic/possibly-flaky tests in one test signal)
> but I'm sure there are indeed jobs where this doesn't make sense.
>
> Kenn
>
>
>
>>
>> Thanks,
>> Danny
>>
>>
>>>
>>> Another option is to normalize the naming of every job, saying the job
>>> name is X, then workflow name is PreCommit_X or PostCommit_X, and the
>>> phrase is Run X. Currently most PreCommit follow this pattern, but there
>>> are also many outliers. A good start could be to clean up all jobs
>>> to follow the same pattern.
>>>
>>>
>>> On Tue, Oct 10, 2023 at 9:57 AM Kenneth Knowles  wrote:
>>>
 FWIW I aware of the README in
 https://github.com/apache/beam/tree/master/.test-infra/jenkins that
 lists the phrases alongside the jobs. This is just wasted work to maintain
 IMO.

 Kenn

 On Tue, Oct 10, 2023 at 9:46 AM Kenneth Knowles 
 wrote:

> *Proposal:* make all the job names exactly match the GH comment to
> run them and make it also as close as possible to how to reproduce locally
>
> *Example problems*:
>
>  - We have really silly redundant jobs results like 'Chicago Taxi
> Example on Dataflow ("Run Chicago Taxi on Dataflow")' and
> 'Python_Xlang_IO_Dataflow ("Run Python_Xlang_IO_Dataflow PostCommit")'
>
>  - We have jobs that there's no way you could guess the command
> 'Google Cloud Dataflow Runner V2 Java ValidatesRunner Tests (streaming)'
>
>  - (nit) We are weirdly inconsistent about using spaces vs
> underscores. I don't think any of our infrastructure cares about this.
>
> *Extra proposal*: make the job name also the local command, where
> possible
>
> *Example: *
> https://github.com/apache/beam/blob/master/.github/workflows/beam_PostCommit_Java_ValidatesRunner_Dataflow.yml
>
>  - This runs :runners:google-cloud-dataflow-java:validatesRunner
>  - So make the status label
> ":runners:google-cloud-dataflow-java:validatesRunner"
>  - "Run :runners:google-cloud-dataflow-java:validatesRunner" as comment
>
> If I want to run it locally, yes there are GCP things I have to set
> up, but I know the gradle command now.
>
> *Corollary*: remove "postcommit" and "precommit" from names, because
> whether a 

Re: [YAML] Fileio sink parameterization (streaming, sharding, and naming)

2023-10-09 Thread Robert Burke
I'll note that the file "Writes" in the Go SDK are currently an unscalable
antipattern, because of this exact question.

 Aside from carefully examining other SDKs it's not clear how one authors a
reliable, automatically shardable, window and pane aware in an arbitrary
SDK, simply by referring to common beam constructs.

Closely examining how other SDKs do it is time consuming and an
antipattern, and doesn't lend itself to educating arbitrary beam end users
on good patterns and why they work, because they tend not to have that sort
of commentary (for all the complexity you mention.)

But it's just as likely I missed a document somewhere. It has been a while
since I last searched for this, let alone have time to do the deep dives
required to produce it.

Robert Burke
Beam Go Busybody


On Mon, Oct 9, 2023, 12:37 PM Robert Bradshaw via dev 
wrote:

> Currently the various file writing configurations take a single parameter,
> path, which indicates where the (sharded) output should be placed. In other
> words, one can write something like
>
>   pipeline:
> ...
> sink:
>   type: WriteToParquet
>   config:
> path: /beam/filesytem/dest
>
> and one gets files like "/beam/filesystem/dest-X-of-N"
>
> Of course, in practice file writing is often much more complicated than
> this (especially when it comes to Streaming). For reference, I've included
> links to our existing offerings in the various SDKs below. I'd like to
> start a discussion about what else should go in the "config" parameter and
> how it should be expressed in YAML.
>
> The primary concern is around naming. This can generally be split into (1)
> the prefix, which must be provided by the users (2) the sharing
> information, includes both shard counts (e.g. (the -X-of-N suffix) but also
> windowing information (for streaming pipelines) which we may want to allow
> the user to customize the formatting of, and (3) a suffix like .json or
> .avro that is useful for both humans and tooling and can often be inferred
> but should allow customization as well.
>
> An interesting case is that of dynamic destinations, where the prefix (or
> other parameters) may themselves be functions of the records themselves. (I
> am excluding the case where the format itself is variable--such cases are
> probably better handled by explicitly partitioning the data and doing
> multiple writes, as this introduces significant complexities and the set of
> possible formats is generally finite and known ahead of time.) I propose
> that we leverage the fact that we have structured data to be able to pull
> out these dynamic parameters. For example, if we have an input data set
> with a string column my_col we could allow something like
>
>   config:
> path: {dynamic: my_col}
>
> which would pull this information out at runtime. (With the MapToFields
> transform, it is very easy to compute/append additional fields to existing
> records.) Generally this field would then be stripped from the written
> data, which would only see the subset of non-dynamically referenced columns
> (though this could be configurable: we could add an attribute like
> {dynamic: my_col, Keep: true} or require the set of columns to be actually
> written (or elided) to be enumerated in the config or allow/require the
> actual data to be written to be in a designated field of the "full" input
> records as arranged by a preceding transform). It'd be great to get
> input/impressions from a wide range of people here on what would be the
> most natural. Often just writing out snippets of various alternatives can
> be quite informative (though I'm avoiding putting them here for the moment
> to avoid biasing ideas right off the bat).
>
> For streaming pipelines it is often essential to write data out in a
> time-partitioned manner. The typical way to do this is to add the windowing
> information into the shard specification itself, and a (set of) file(s) is
> written on each window closing. Beam YAML already supports any transform
> being given a "windowing" configuration which will cause a WindowInto
> transform to be applied to its input(s) before application which can sit
> naturally on a sink. We may want to consider if non-windowed writes make
> sense as well (though how this interacts with the watermark and underlying
> implementations are a large open question, so this is a larger change that
> might make sense to defer).
>
> Note that I am explicitly excluding "coders" here. All data in YAML should
> be schema'd, and writers should know how to write this structured data. We
> may want to allow a "schema" field to allow a user to specify the desired
> schema in a manner compatible with the sink format itself (

Re: Reshuffle PTransform Design Doc

2023-10-05 Thread Robert Burke
Reshuffle/redistribute being a transform has the benefit of allowing
existing runners that aren't updated to be aware of the new urns to rely on
an SDK side implementation, which may be more expensive than what the
runner is able to do with that awareness.

Aka: it gives purpose to the fallback implementations.

On Thu, Oct 5, 2023, 9:03 AM Kenneth Knowles  wrote:

> Another perspective, ignoring runners custom implementations and non-Java
> SDKs could be that the semantics are perfectly well defined: it is a
> composite and its semantics are defined by its implementation in terms of
> primitives. It is just that this expansion is not what we want so we should
> not use it (and also we shouldn't use "whatever the implementation does" as
> a spec for anything we care about).
>
> On Thu, Oct 5, 2023 at 11:56 AM Kenneth Knowles  wrote:
>
>> I totally agree. I am motivated right now by the fact that it is already
>> used all over the place but with no consistent semantics. Maybe it is
>> simpler to focus on just making the minimal change, which would basically
>> be to update the expansion of the Reshuffle in the Java SDK.
>>
>> Kenn
>>
>> On Thu, Oct 5, 2023 at 11:39 AM John Casey 
>> wrote:
>>
>>> Given that this is a hint, I'm not sure redistribute should be a
>>> PTransform as opposed to some other way to hint to a runner.
>>>
>>> I'm not sure of what the syntax of that would be, but a semantic no-op
>>> transform that the runner may or may not do anything with is odd.
>>>
>>> On Thu, Oct 5, 2023 at 11:30 AM Kenneth Knowles  wrote:
>>>
>>>> So a high level suggestion from Robert that I want to highlight as a
>>>> top-post:
>>>>
>>>> Instead of focusing on just fixing the SDKs and runners Reshuffle, this
>>>> could be an opportunity to introduce Redistribute which was proposed in the
>>>> long-ago thread. The semantics are identical but it is more clear that it
>>>> *only* is a hint about redistributing data and there is no expectation
>>>> of a checkpoint.
>>>>
>>>> This new name may also be an opportunity to maintain update
>>>> compatibility (though this may actually be leaving unsafe code in user's
>>>> hands) and/or separate @RequiresStableInput/checkpointing uses of Reshuffle
>>>> from redistribution-only uses of Reshuffle.
>>>>
>>>> Any other thoughts on this one high level bit?
>>>>
>>>> Kenn
>>>>
>>>> On Thu, Oct 5, 2023 at 11:15 AM Kenneth Knowles 
>>>> wrote:
>>>>
>>>>>
>>>>> On Wed, Oct 4, 2023 at 7:45 PM Robert Burke 
>>>>> wrote:
>>>>>
>>>>>> LGTM.
>>>>>>
>>>>>> It looks the Go SDK already adheres to these semantics as well for
>>>>>> the reference impl(well, reshuffle/redistribute_randomly, _by_key isn't
>>>>>> implemented in the Go SDK, and only uses the existing unqualified 
>>>>>> reshuffle
>>>>>> URN [0].
>>>>>>
>>>>>> The original strategy, and then for every element, the original
>>>>>> Window, TS, and Pane are all serialized, shuffled, and then deserialized
>>>>>> downstream.
>>>>>>
>>>>>>
>>>>>> https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/core/runtime/exec/reshuffle.go#L65
>>>>>>
>>>>>>
>>>>>> https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/core/runtime/exec/reshuffle.go#L145
>>>>>>
>>>>>> Prism at the moment vaccuously implements reshuffle by omitting the
>>>>>> node, and rewriting the inputs and outputs [1], as it's a local runner 
>>>>>> with
>>>>>> single transform per bundle execution, but I was intending to make it a
>>>>>> fusion break regardless.  Ultimately prism's "test" variant will default 
>>>>>> to
>>>>>> executing the SDKs dictated reference implementation for the 
>>>>>> composite(s),
>>>>>> and any "fast" or "prod" variant would simply do the current 
>>>>>> implementation.
>>>>>>
>>>>>
>>>>> Very nice!
>>>>>
>>>>> And of course I should have linked out to the existing reshuffle URN
>>>>> in the proto.
>>>>>
>>>>> Kenn
&

Re: Reshuffle PTransform Design Doc

2023-10-04 Thread Robert Burke
LGTM.

It looks the Go SDK already adheres to these semantics as well for the 
reference impl(well, reshuffle/redistribute_randomly, _by_key isn't implemented 
in the Go SDK, and only uses the existing unqualified reshuffle URN [0].

The original strategy, and then for every element, the original Window, TS, and 
Pane are all serialized, shuffled, and then deserialized downstream.

https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/core/runtime/exec/reshuffle.go#L65

https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/core/runtime/exec/reshuffle.go#L145

Prism at the moment vaccuously implements reshuffle by omitting the node, and 
rewriting the inputs and outputs [1], as it's a local runner with single 
transform per bundle execution, but I was intending to make it a fusion break 
regardless.  Ultimately prism's "test" variant will default to executing the 
SDKs dictated reference implementation for the composite(s), and any "fast" or 
"prod" variant would simply do the current implementation.

Robert Burke
Beam Go Busybody

[0]: 
https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/core/runtime/graphx/translate.go#L46C3-L46C50
[1]: 
https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/runners/prism/internal/handlerunner.go#L82



On 2023/09/26 15:43:53 Kenneth Knowles wrote:
> Hi everyone,
> 
> Recently there was a bug [1] caused by discrepancies between two of
> Dataflow's reshuffle implementations. I think the reference implementation
> in the Java SDK [2] also does not match. This all led to discussion on the
> bug and the pull request [3] about what the actual semantics should be. I
> got it wrong, maybe multiple times. So I wrote up a very short document to
> finish the discussion:
> 
> https://s.apache.org/beam-reshuffle
> 
> This is also probably among the simplest imaginable use of
> http://s.apache.org/ptransform-design-doc in case you want to see kind of
> how I intended it to be used.
> 
> Kenn
> 
> [1] https://github.com/apache/beam/issues/28219
> [2]
> https://github.com/apache/beam/blob/d52b077ad505c8b50f10ec6a4eb83d385cdaf96a/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Reshuffle.java#L84
> [3] https://github.com/apache/beam/pull/28272
> 


Re: Runner Bundling Strategies

2023-09-26 Thread Robert Burke
Oh neat, Preserving Keys. I didn't think we had/provided a mechanism for
declaring that.

Good doc.

I do know there's no annotation for FinishBundle. It's generally optional
and SDK side only as a concern.

Finalize Bundle is a different mechanism which does have an annotation, and
requires both SDK and runner support to trigger after a bundle has been
committed/checkpointed.


On Tue, Sep 26, 2023, 8:54 AM Kenneth Knowles  wrote:

>
>
> On Mon, Sep 25, 2023 at 1:19 PM Jan Lukavský  wrote:
>
>> Hi Kenn and Reuven,
>>
>> I agree with all these points. The only issue here seems to be that
>> FlinkRunner does not fulfill these constraints. This is a bug that can be
>> fixed, though we need to change some defaults, as 1000 ms default bundle
>> "duration" for lower traffic Pipelines can be too much. We are also
>> probably missing some @ValidatesReunner tests for this. I created [1] and
>> [2] to track this.
>>
>> One question still remains, the bundle vs. element life-cycle is relevant
>> only for cases where processing of element X can affect processing of
>> element Y later in the same bundle. Once this influence is rules out (i.e.
>> no caching), this information can result in runner optimization that yields
>> better performance. Should we consider propagate this information from user
>> code to the runner?
>>
> Yes!
>
> This was the explicit goal of the move to annotation-driven DoFn in
> https://s.apache.org/a-new-dofn to make it so that the SDK and runner can
> get good information about what the DoFn requirements are.
>
> When there is no @FinishBundle method, the runner can make additional
> optimizations. This should have been included in the ParDoPayload in the
> proto when we moved to portable pipelines. I cannot remember if there was a
> good reason that we did not do so. Maybe we (incorrectly) thought that this
> was an issue that only the Java SDK harness needed to know about.
>
> Kenn
>
>
>> [1] https://github.com/apache/beam/issues/28649
>>
>> [2] https://github.com/apache/beam/issues/28650
>> On 9/25/23 18:31, Reuven Lax via dev wrote:
>>
>>
>>
>> On Mon, Sep 25, 2023 at 6:19 AM Jan Lukavský  wrote:
>>
>>>
>>> On 9/23/23 18:16, Reuven Lax via dev wrote:
>>>
>>> Two separate things here:
>>>
>>> 1. Yes, a watermark can update in the middle of a bundle.
>>> 2. The records in the bundle themselves will prevent the watermark from
>>> updating as they are still in flight until after finish bundle. Therefore
>>> simply caching the records should always be watermark safe, regardless of
>>> the runner. You will only run into problems if you try and move timestamps
>>> "backwards" - which is why Beam strongly discourages this.
>>>
>>> This is not aligned with  FlinkRunner's implementation. And I actually
>>> think it is not aligned conceptually.  As mentioned, Flink does not have
>>> the concept of bundles at all. It achieves fault tolerance via
>>> checkpointing, essentially checkpoint barrier flowing from sources to
>>> sinks, safely snapshotting state of each operator on the way. Bundles are
>>> implemented as a somewhat arbitrary set of elements between two consecutive
>>> checkpoints (there can be multiple bundles between checkpoints). A bundle
>>> is 'committed' (i.e. persistently stored and guaranteed not to retry) only
>>> after the checkpoint barrier passes over the elements in the bundle (every
>>> bundle is finished at the very latest exactly before a checkpoint). But
>>> watermark propagation and bundle finalization is completely unrelated. This
>>> might be a bug in the runner, but requiring checkpoint for watermark
>>> propagation will introduce insane delays between processing time and
>>> watermarks, every executable stage will delay watermark propagation until a
>>> checkpoint (which is typically the order of seconds). This delay would add
>>> up after each stage.
>>>
>>
>> It's not bundles that hold up processing, rather it is elements, and
>> elements are not considered "processed" until FinishBundle.
>>
>> You are right about Flink. In many cases this is fine - if Flink rolls
>> back to the last checkpoint, the watermark will also roll back, and
>> everything stays consistent. So in general, one does not need to wait for
>> checkpoints for watermark propagation.
>>
>> Where things get a bit weirder with Flink is whenever one has external
>> side effects. In theory, one should wait for checkpoints before letting a
>> Sink flush, otherwise one could end up with incorrect outputs (especially
>> with a sink like TextIO). Flink itself recognizes this, and that's why they
>> provide TwoPhaseCommitSinkFunction
>> 
>>  which
>> waits for a checkpoint. In Beam, this is the reason we introduced
>> RequiresStableInput. Of course in practice many Flink users don't do this -
>> in which case they are prioritizing latency over data correctness.
>>
>>>
>>> 

Re: Runner Bundling Strategies

2023-09-21 Thread Robert Burke
Depends entirely on the use case really.

Currently for the Prism runner I'm working on for the Go SDK is "bundles
are the size of ready data", which will do OK for having lower latency for
downstream transforms.  It will also tell the SDK to split bundles if an
element takes longer than 200milliseconds to process.

Dataflow Batch jobs will generally start with extremely large bundle sizes
and then use channel splitting and Sub Element splitting to divide work
further than the initial splits. This is basically the opposite strategy
your initial strategy takes

Dataflow streaming tends to do hundreds of single elements bundles per
worker to reduce processing latency.

I can't speak to the Flink and Spark strategies.

Robert Burke
Beam Go Busybody

On Thu, Sep 21, 2023, 4:24 PM Joey Tran  wrote:

> Writing a runner and the first strategy for determining bundling size was
> to just start with a bundle size of one and double it until we reach a size
> that we expect to take some targets per-bundle runtime (e.g. maybe 10
> minutes). I realize that this isn't the greatest strategy for high sized
> cost transforms. I'm curious what kind of strategies other runners take?
>


Re: User-facing website vs. contributor-facing website

2023-09-21 Thread Robert Burke
TBH i find the wiki to be entirely unfriendly. It hard to find things in it
and isn't discoverable. The syntax is archaic and the UI is wonky. There's
no "flow" to it. No common entry point etc.

I'd rather the release guide remain in Github as markdown, even if it's not
on the website anymore.

This also keeps contributor documentation where contributors are actually
working.

Also, the lack of review for wiki changes is convenient for tiny typos but
honestly I'd prefer our documentation gets read by at least one other
person when we commit to it.

So I'm +1 to moving it, but -0 if it's to the wiki.


On Thu, Sep 21, 2023, 9:58 AM Chamikara Jayalath via dev <
dev@beam.apache.org> wrote:

> I might be wrong but I think of wiki as a more volatile and a less
> reliable place than the Website (can be updated without a review by any
> committer and we do that quite often). I think things in the
> contribution guide are key to a healthy Beam community so I'd like them to
> be in a more stable place that gets reviewed appropriately when updated.
>
> Thanks,
> Cham
>
> On Thu, Sep 21, 2023 at 9:14 AM Danny McCormick via dev <
> dev@beam.apache.org> wrote:
>
>> +1 on moving the release guide. I'd argue that everything under the
>> `contribute` tag other than the main page (
>> https://beam.apache.org/contribute/) and the link to CONTRIBUTING.md
>>  makes more
>> sense on the wiki (we can keep the section with the sidebar links just
>> redirecting to the wiki). I don't think it makes sense to move anything
>> else, but the contributing section is inherently "dev focused".
>>
>> Thanks,
>> Danny
>>
>> On Thu, Sep 21, 2023 at 11:58 AM Kenneth Knowles  wrote:
>>
>>> Hello!
>>>
>>> I am reviving a discussion that began at
>>> https://lists.apache.org/thread/w4g8xpg4215nlq86hxbd6n3q7jfnylny when
>>> we started our Confluence wiki and has even been revived once before.
>>>
>>> The conclusion of that thread was basically "yes, let us separate the
>>> contributor-facing stuff to a different site". It also was the boot up of
>>> the Confluence wiki but I want to not discuss tech/hosting for this thread.
>>> I want to focus on the issue of having a separate user-facing website vs a
>>> contributor-facing website. Some things like issue priorities are
>>> user-and-dev facing and they require review for changes and should stay on
>>> the user site. I also don't want to get into those more complex cases.
>>>
>>> We are basically in a halfway state today because I didn't have enough
>>> volunteer time to finish everything and I did not wrangle enough help.
>>>
>>> So now I am release manager and encountering the docs more closely
>>> again. The release docs really blend stuff.
>>>
>>>   - The main release guide is on the website.
>>>  - Some steps, though, are GitHub Issues that we push along from release
>>> Milestone to the next one.
>>>  - The actual technical bits to do the steps are sometimes on the
>>> confluence wiki
>>>  - I expect I will also be touching README files in various folders of
>>> the repo
>>>
>>> So I just want to make some more steps, and I wanted to ask the
>>> community for their current thoughts. I think one big step could be to move
>>> the release guide itself to the dev site, which is currently the wiki.
>>>
>>> What do you think? Are there any other areas of the website that you
>>> think could just move to the wiki today?
>>>
>>> Kenn
>>>
>>> p.s. Some time in the past I saw an upper right corner fold (like
>>> https://www.istockphoto.com/illustrations/paper-corner-fold) that took
>>> you to the dev site that looked the same with different color scheme. That
>>> was fun :-)
>>>
>>


Re: [PROPOSAL] Preparing for 2.51.0 Release

2023-09-13 Thread Robert Burke
Thanks Kenn!


On Wed, Sep 13, 2023, 6:20 PM Kenneth Knowles  wrote:

> Hello Beam community!
>
> The next release (2.51.0) branch cut is scheduled for September 20, 2023,
> one week from today, according to the release calendar [1].
>
> I'd like to volunteer to perform this release. My plan is to cut the
> branch on that date, and cherrypick release-blocking fixes afterwards, if
> any.
>
> Please help me make sure the release goes smoothly by:
>
> - Making sure that any unresolved release blocking issues for 2.51.0
> should have their "Milestone" marked as "2.51.0 Release" as soon as
> possible.
>
> - Reviewing the current release blockers [2] and remove the Milestone if
> they don't meet the criteria at [3]. There are currently 12 release
> blockers.
>
> Let me know if you have any comments/objections/questions.
>
> Thanks,
>
> Kenn
>
> [1]
>
> https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com
> [2] https://github.com/apache/beam/milestone/15
> [3] https://beam.apache.org/contribute/release-blocking/
>


Re: Contribution of Asgarde: Error Handling for Beam?

2023-09-08 Thread Robert Burke
I would say until we *require* Asgard on a core transform, it shouldn't be
in the main repo.

Incorporating something before there's a need for it is premature
abstraction. We can't do things because they *might* be useful. Let's see
concrete places where they are useful, or we're already having a similar
need solved a different way.

Beam is complicated by itself, and we do encourage multiple ways of solving
problems, but that says to me that having an out of repo ecosystem is the
right path, rather than incorporation.

On Fri, Sep 8, 2023, 8:14 AM Daniel Collins via dev 
wrote:

> I think there are a lot of interesting and relatively isolated components
> of the project, it might make sense to write per-transform one pagers for
> isolated things like the most useful pieces (just basically copying the
> documentation and justifying the API) instead of doing a one-shot import or
> having it live forever in an external project.
>
> -Daniel
>
> On Fri, Sep 8, 2023 at 11:10 AM Kenneth Knowles  wrote:
>
>> I agree with everyone about "not everything has to be in the Beam repo".
>> I really like the idea of having a clearer "ecosystem" section of the
>> website, which is sort of started at
>> https://beam.apache.org/community/integrations/ but that is not very
>> prominent.
>>
>> Agree with John though. The transforms in Asgarde could potentially be
>> used in Beam. Potentially best accomplished by just adding them as
>> transforms to the core Java SDK?
>>
>> Kenn
>>
>> On Wed, Sep 6, 2023 at 1:46 PM John Casey via dev 
>> wrote:
>>
>>> Agreed on documentation and on keeping it in a separate repo.
>>>
>>> We have a few pretty significant beam extensions (scio and Dataflow
>>> Templates also come to mind) that Beam should highlight, but are separate
>>> repos for their own governance, contributions, and release reasons.
>>>
>>> The difference with Asgarde is that we might want to use it in Beam
>>> itself, which makes it more reasonable to include in the main repo.
>>>
>>> On Tue, Sep 5, 2023 at 8:36 PM Robert Bradshaw via dev <
>>> dev@beam.apache.org> wrote:
>>>
 I think this is a great library. I'm on the fence of whether it makes
 sense to include with Beam proper vs. be a library that builds on top of
 Beam. (Would there be benefits of tighter integration? There is the
 maintenance/loss of governance issue.) I am definitely not on the side that
 the entire Beam ecosystem needs to be distributed/maintained by Beam
 itself.

 Regardless of the direction we go, I think it could make a lot of sense
 to put pointers to it in our documentation.


 On Tue, Sep 5, 2023 at 7:21 AM Danny McCormick via dev <
 dev@beam.apache.org> wrote:

> I think my only concerns here are around the toil we'll be taking on,
> and will we be leaving the asgarde project in a better or worse place.
>
> From a release standpoint, we would need to release it with the same
> cadence as Beam. Adding asgarde into our standard release process seems
> fairly straightforward, though, so I'm not too worried about it - looks
> like it's basically (1) add a commit like this
> ,
> (2) run this workflow
> ,
> and (3) tag/mark the release as released on GitHub.
>
> In terms of bug fixes and improvements, though, I'm a little worried
> that we might be leaving things in a worse state since Mazlum has been the
> only contributor thus far, and he would lose some governance (and possibly
> the ability to commit code on his own). An extra motivated community 
> member
> or two could change the math a bit, but I'm not sure if there are actually
> clear advantages to including it in Apache other than visibility. Would
> adding links to our docs calling Asgarde out as an option accomplish the
> same purpose?
>
> > Let's be careful about whether these tests are included in our
> presubmits. Contrib code with flaky tests has been a major pain point in
> the past.
>
> +1 - I think if we do this I'd vote that it be in a separate repo (
> github.com/apache/beam-asgarde made sense to me).
>
> ---
>
> Overall, I'm probably a slight -1 to adding this to the Apache
> workspace, but +1 to at least adding links from the Beam docs to Asgarde.
>
> Thanks,
> Danny
>
>
>
> On Tue, Sep 5, 2023 at 12:03 AM Reuven Lax via dev <
> dev@beam.apache.org> wrote:
>
>> Let's be careful about whether these tests are included in our
>> presubmits. Contrib code with flaky tests has been a major pain point in
>> the past.
>>
>> On Sat, Sep 2, 2023 at 12:02 PM Austin Bennett 
>> wrote:
>>
>>> Wanting us to not miss this. @Mazlum 

Re: [ANNOUNCE] Beam 2.50.0 Released

2023-08-30 Thread Robert Burke
Of course I miss replacing one of them. Thanks!

On Wed, Aug 30, 2023 at 11:43 AM Robert Burke  wrote:

> The Apache Beam Team is pleased to announce the release of version 2.50.0.
>
> You can download the release here:
>
> https://beam.apache.org/get-started/downloads/
>
> This release includes bug fixes, features, and improvements detailed on the
> Beam Blog: https://beam.apache.org/blog/beam-2.50.0/
> and the Github release page
> https://github.com/apache/beam/releases/tag/v2.50.0
>
> Thanks to everyone who contributed to this release, and we hope you enjoy
> using Beam 2.50.0.
>
> -- Robert Burke, on behalf of the Apache Beam Team.
>


[ANNOUNCE] Beam 2.50.0 Released

2023-08-30 Thread Robert Burke
The Apache Beam Team is pleased to announce the release of version 2.49.0.

You can download the release here:

https://beam.apache.org/get-started/downloads/

This release includes bug fixes, features, and improvements detailed on the
Beam Blog: https://beam.apache.org/blog/beam-2.50.0/
and the Github release page
https://github.com/apache/beam/releases/tag/v2.50.0

Thanks to everyone who contributed to this release, and we hope you enjoy
using Beam 2.50.0.

-- Robert Burke, on behalf of the Apache Beam Team.


[RESULT] [VOTE] Release 2.50.0, release candidate #2

2023-08-29 Thread Robert Burke
I'm happy to announce that we have unanimously approved this release.

There are 7 approving votes, 3 of which are binding:
* Ahmet Altay
* Chamikara Jayalath
* Robert Bradshaw

There are no disapproving votes.

Thanks everyone!


Re: [VOTE] Release 2.50.0, release candidate #2

2023-08-29 Thread Robert Burke
"""
On Tue, Aug 29, 2023, 12:27 PM Robert Bradshaw  wrote:
+1 (binding)

Verified the artifacts and signatures, they all look good. Tried a simple
Python pipeline in a fresh install that worked fine. Thanks for putting
this together.

"""


On Tue, Aug 29, 2023, 3:28 PM Robert Burke  wrote:

> Robert's vote ended up on the thread I made to solicit votes from PMC
> members.
>
> This concludes voting for RC2.
>
> On Tue, Aug 29, 2023, 12:27 PM Robert Bradshaw 
> wrote:
>
>> +1 (binding)
>>
>> Verified the artifacts and signatures, they all look good. Tried a simple
>> Python pipeline in a fresh install that worked fine. Thanks for putting
>> this together.
>>
>> On Tue, Aug 29, 2023 at 10:51 AM Robert Burke  wrote:
>>
>>> I was encouraged to ping y'all for Beam release validation and binding
>>> votes. So far it's looking good, but only 2 PMC votes.
>>>
>>> Cheers,
>>> Robert Burke
>>> 2.50.0 Release Manager
>>>
>>> -- Forwarded message -
>>> From: Robert Burke 
>>> Date: Tue, Aug 29, 2023, 10:44 AM
>>> Subject: Re: [VOTE] Release 2.50.0, release candidate #2
>>> To: 
>>>
>>>
>>> Current status is six +1 votes, two binding. Thanks to everyone so far
>>> who has participated in validating the release!
>>>
>>> Voting for RC2 will remain open until there's further PMC engagement for
>>> binding votes, or we find cause for a RC3.
>>>
>>> Thanks
>>> Robert Burke
>>> Apache Beam 2.50.0 Release Manager
>>>
>>> On 2023/08/28 19:51:17 Chamikara Jayalath via dev wrote:
>>> > +1 (binding)
>>> >
>>> > Validated by running some multi-lang jobs.
>>> >
>>> > Thanks,
>>> > Cham
>>> >
>>> > On Mon, Aug 28, 2023 at 10:40 AM Yi Hu via dev 
>>> wrote:
>>> >
>>> > > +1 (non-binding)
>>> > >
>>> > > Verified Java IO load tests (TextIO, BigQuery, Bigtable) on Dataflow
>>> > > runner (legacy and V2) using
>>> https://github.com/apache/beam/tree/master/it
>>> > >
>>> > > On Mon, Aug 28, 2023 at 1:13 PM Ahmet Altay via dev <
>>> dev@beam.apache.org>
>>> > > wrote:
>>> > >
>>> > >> +1 (binding).
>>> > >>
>>> > >> I validated python quick starts on direct and dataflow runners.
>>> Thank you
>>> > >> for working on the release!
>>> > >>
>>> > >> On Mon, Aug 28, 2023 at 8:48 AM Robert Burke 
>>> wrote:
>>> > >>
>>> > >>> Good morning!
>>> > >>>
>>> > >>> RC2 validation and vote is still open!
>>> > >>>
>>> > >>> On Sun, Aug 27, 2023, 1:28 PM XQ Hu via dev 
>>> wrote:
>>> > >>>
>>> > >>>> +1
>>> > >>>> Ran the simple Dataflow ML GPU batch job using
>>> > >>>> https://github.com/google/dataflow-ml-starter with Python
>>> 2.50.0rc2 to
>>> > >>>> validate the RC works well.
>>> > >>>>
>>> > >>>> On Sat, Aug 26, 2023 at 12:16 AM Valentyn Tymofieiev via dev <
>>> > >>>> dev@beam.apache.org> wrote:
>>> > >>>>
>>> > >>>>> +1
>>> > >>>>>
>>> > >>>>> Verified that the issue detected in RC0 has been resolved.
>>> > >>>>> Successfully ran a Python pipeline on ARM Dataflow workers.
>>> > >>>>>
>>> > >>>>> Noted that Dataflow runner logs became less verbose as the
>>> result of
>>> > >>>>> https://github.com/apache/beam/pull/27788. One line that I
>>> often pay
>>> > >>>>> attention to no longer appears at the default  INFO log level:
>>> > >>>>>
>>> > >>>>> ```
>>> > >>>>>
>>> INFO:apache_beam.runners.dataflow.dataflow_runner:2023-08-26T03:45:35.126Z:
>>> > >>>>> JOB_MESSAGE_DETAILED: All workers have finished the startup
>>> processes and
>>> > >>>>> began to receive work requests.
>>> > >>>>> ```
>>> > >>>>>
>>> > >>>

Re: [VOTE] Release 2.50.0, release candidate #2

2023-08-29 Thread Robert Burke
Robert's vote ended up on the thread I made to solicit votes from PMC
members.

This concludes voting for RC2.

On Tue, Aug 29, 2023, 12:27 PM Robert Bradshaw  wrote:

> +1 (binding)
>
> Verified the artifacts and signatures, they all look good. Tried a simple
> Python pipeline in a fresh install that worked fine. Thanks for putting
> this together.
>
> On Tue, Aug 29, 2023 at 10:51 AM Robert Burke  wrote:
>
>> I was encouraged to ping y'all for Beam release validation and binding
>> votes. So far it's looking good, but only 2 PMC votes.
>>
>> Cheers,
>> Robert Burke
>> 2.50.0 Release Manager
>>
>> -- Forwarded message -
>> From: Robert Burke 
>> Date: Tue, Aug 29, 2023, 10:44 AM
>> Subject: Re: [VOTE] Release 2.50.0, release candidate #2
>> To: 
>>
>>
>> Current status is six +1 votes, two binding. Thanks to everyone so far
>> who has participated in validating the release!
>>
>> Voting for RC2 will remain open until there's further PMC engagement for
>> binding votes, or we find cause for a RC3.
>>
>> Thanks
>> Robert Burke
>> Apache Beam 2.50.0 Release Manager
>>
>> On 2023/08/28 19:51:17 Chamikara Jayalath via dev wrote:
>> > +1 (binding)
>> >
>> > Validated by running some multi-lang jobs.
>> >
>> > Thanks,
>> > Cham
>> >
>> > On Mon, Aug 28, 2023 at 10:40 AM Yi Hu via dev 
>> wrote:
>> >
>> > > +1 (non-binding)
>> > >
>> > > Verified Java IO load tests (TextIO, BigQuery, Bigtable) on Dataflow
>> > > runner (legacy and V2) using
>> https://github.com/apache/beam/tree/master/it
>> > >
>> > > On Mon, Aug 28, 2023 at 1:13 PM Ahmet Altay via dev <
>> dev@beam.apache.org>
>> > > wrote:
>> > >
>> > >> +1 (binding).
>> > >>
>> > >> I validated python quick starts on direct and dataflow runners.
>> Thank you
>> > >> for working on the release!
>> > >>
>> > >> On Mon, Aug 28, 2023 at 8:48 AM Robert Burke 
>> wrote:
>> > >>
>> > >>> Good morning!
>> > >>>
>> > >>> RC2 validation and vote is still open!
>> > >>>
>> > >>> On Sun, Aug 27, 2023, 1:28 PM XQ Hu via dev 
>> wrote:
>> > >>>
>> > >>>> +1
>> > >>>> Ran the simple Dataflow ML GPU batch job using
>> > >>>> https://github.com/google/dataflow-ml-starter with Python
>> 2.50.0rc2 to
>> > >>>> validate the RC works well.
>> > >>>>
>> > >>>> On Sat, Aug 26, 2023 at 12:16 AM Valentyn Tymofieiev via dev <
>> > >>>> dev@beam.apache.org> wrote:
>> > >>>>
>> > >>>>> +1
>> > >>>>>
>> > >>>>> Verified that the issue detected in RC0 has been resolved.
>> > >>>>> Successfully ran a Python pipeline on ARM Dataflow workers.
>> > >>>>>
>> > >>>>> Noted that Dataflow runner logs became less verbose as the result
>> of
>> > >>>>> https://github.com/apache/beam/pull/27788. One line that I often
>> pay
>> > >>>>> attention to no longer appears at the default  INFO log level:
>> > >>>>>
>> > >>>>> ```
>> > >>>>>
>> INFO:apache_beam.runners.dataflow.dataflow_runner:2023-08-26T03:45:35.126Z:
>> > >>>>> JOB_MESSAGE_DETAILED: All workers have finished the startup
>> processes and
>> > >>>>> began to receive work requests.
>> > >>>>> ```
>> > >>>>>
>> > >>>>> Dataflow service can be adjusted to compensate for this (internal
>> > >>>>> change: http://cl/560265419 ).
>> > >>>>>
>> > >>>>> On Fri, Aug 25, 2023 at 3:05 PM Bruno Volpato via dev <
>> > >>>>> dev@beam.apache.org> wrote:
>> > >>>>>
>> > >>>>>> +1 (non-binding).
>> > >>>>>>
>> > >>>>>> Tested with
>> https://github.com/GoogleCloudPlatform/DataflowTemplates
>> > >>>>>> (Java SDK 11, Dataflow runner).
>> > >>>>>>
>> > >>>>>> Thanks Robert!
>>

Re: [VOTE] Release 2.50.0, release candidate #2

2023-08-29 Thread Robert Burke
Current status is six +1 votes, two binding. Thanks to everyone so far who has 
participated in validating the release! 

Voting for RC2 will remain open until there's further PMC engagement for 
binding votes, or we find cause for a RC3.

Thanks
Robert Burke
Apache Beam 2.50.0 Release Manager

On 2023/08/28 19:51:17 Chamikara Jayalath via dev wrote:
> +1 (binding)
> 
> Validated by running some multi-lang jobs.
> 
> Thanks,
> Cham
> 
> On Mon, Aug 28, 2023 at 10:40 AM Yi Hu via dev  wrote:
> 
> > +1 (non-binding)
> >
> > Verified Java IO load tests (TextIO, BigQuery, Bigtable) on Dataflow
> > runner (legacy and V2) using https://github.com/apache/beam/tree/master/it
> >
> > On Mon, Aug 28, 2023 at 1:13 PM Ahmet Altay via dev 
> > wrote:
> >
> >> +1 (binding).
> >>
> >> I validated python quick starts on direct and dataflow runners. Thank you
> >> for working on the release!
> >>
> >> On Mon, Aug 28, 2023 at 8:48 AM Robert Burke  wrote:
> >>
> >>> Good morning!
> >>>
> >>> RC2 validation and vote is still open!
> >>>
> >>> On Sun, Aug 27, 2023, 1:28 PM XQ Hu via dev  wrote:
> >>>
> >>>> +1
> >>>> Ran the simple Dataflow ML GPU batch job using
> >>>> https://github.com/google/dataflow-ml-starter with Python 2.50.0rc2 to
> >>>> validate the RC works well.
> >>>>
> >>>> On Sat, Aug 26, 2023 at 12:16 AM Valentyn Tymofieiev via dev <
> >>>> dev@beam.apache.org> wrote:
> >>>>
> >>>>> +1
> >>>>>
> >>>>> Verified that the issue detected in RC0 has been resolved.
> >>>>> Successfully ran a Python pipeline on ARM Dataflow workers.
> >>>>>
> >>>>> Noted that Dataflow runner logs became less verbose as the result of
> >>>>> https://github.com/apache/beam/pull/27788. One line that I often pay
> >>>>> attention to no longer appears at the default  INFO log level:
> >>>>>
> >>>>> ```
> >>>>> INFO:apache_beam.runners.dataflow.dataflow_runner:2023-08-26T03:45:35.126Z:
> >>>>> JOB_MESSAGE_DETAILED: All workers have finished the startup processes 
> >>>>> and
> >>>>> began to receive work requests.
> >>>>> ```
> >>>>>
> >>>>> Dataflow service can be adjusted to compensate for this (internal
> >>>>> change: http://cl/560265419 ).
> >>>>>
> >>>>> On Fri, Aug 25, 2023 at 3:05 PM Bruno Volpato via dev <
> >>>>> dev@beam.apache.org> wrote:
> >>>>>
> >>>>>> +1 (non-binding).
> >>>>>>
> >>>>>> Tested with https://github.com/GoogleCloudPlatform/DataflowTemplates
> >>>>>> (Java SDK 11, Dataflow runner).
> >>>>>>
> >>>>>> Thanks Robert!
> >>>>>>
> >>>>>> On Thu, Aug 24, 2023 at 7:12 PM Robert Burke 
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Two minor erata from the previous email:
> >>>>>>>
> >>>>>>> The validation spreadsheet link should be:
> >>>>>>>
> >>>>>>> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=1014811464
> >>>>>>>
> >>>>>>> And the source code tag is: "v2.50.0-RC2"
> >>>>>>>
> >>>>>>> On 2023/08/24 23:09:23 Robert Burke wrote:
> >>>>>>> > Hi everyone,
> >>>>>>> > Please review and vote on the release candidate #2 for the version
> >>>>>>> 2.50.0,
> >>>>>>> > as follows:
> >>>>>>> > [ ] +1, Approve the release
> >>>>>>> > [ ] -1, Do not approve the release (please provide specific
> >>>>>>> comments)
> >>>>>>> >
> >>>>>>> >
> >>>>>>> > Reviewers are encouraged to test their own use cases with the
> >>>>>>> release
> >>>>>>> > candidate, and vote +1 if
> >>>>>>> > no issues are found. Only PMC member votes will count towards the
> >>>>>>> final
> >>>>

Re: [VOTE] Release 2.50.0, release candidate #2

2023-08-28 Thread Robert Burke
Good morning!

RC2 validation and vote is still open!

On Sun, Aug 27, 2023, 1:28 PM XQ Hu via dev  wrote:

> +1
> Ran the simple Dataflow ML GPU batch job using
> https://github.com/google/dataflow-ml-starter with Python 2.50.0rc2 to
> validate the RC works well.
>
> On Sat, Aug 26, 2023 at 12:16 AM Valentyn Tymofieiev via dev <
> dev@beam.apache.org> wrote:
>
>> +1
>>
>> Verified that the issue detected in RC0 has been resolved. Successfully
>> ran a Python pipeline on ARM Dataflow workers.
>>
>> Noted that Dataflow runner logs became less verbose as the result of
>> https://github.com/apache/beam/pull/27788. One line that I often pay
>> attention to no longer appears at the default  INFO log level:
>>
>> ```
>> INFO:apache_beam.runners.dataflow.dataflow_runner:2023-08-26T03:45:35.126Z:
>> JOB_MESSAGE_DETAILED: All workers have finished the startup processes and
>> began to receive work requests.
>> ```
>>
>> Dataflow service can be adjusted to compensate for this (internal change:
>> http://cl/560265419 ).
>>
>> On Fri, Aug 25, 2023 at 3:05 PM Bruno Volpato via dev <
>> dev@beam.apache.org> wrote:
>>
>>> +1 (non-binding).
>>>
>>> Tested with https://github.com/GoogleCloudPlatform/DataflowTemplates
>>> (Java SDK 11, Dataflow runner).
>>>
>>> Thanks Robert!
>>>
>>> On Thu, Aug 24, 2023 at 7:12 PM Robert Burke 
>>> wrote:
>>>
>>>> Two minor erata from the previous email:
>>>>
>>>> The validation spreadsheet link should be:
>>>>
>>>> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=1014811464
>>>>
>>>> And the source code tag is: "v2.50.0-RC2"
>>>>
>>>> On 2023/08/24 23:09:23 Robert Burke wrote:
>>>> > Hi everyone,
>>>> > Please review and vote on the release candidate #2 for the version
>>>> 2.50.0,
>>>> > as follows:
>>>> > [ ] +1, Approve the release
>>>> > [ ] -1, Do not approve the release (please provide specific comments)
>>>> >
>>>> >
>>>> > Reviewers are encouraged to test their own use cases with the release
>>>> > candidate, and vote +1 if
>>>> > no issues are found. Only PMC member votes will count towards the
>>>> final
>>>> > vote, but votes from all
>>>> > community members is encouraged and helpful for finding regressions;
>>>> you
>>>> > can either test your own
>>>> > use cases or use cases from the validation sheet [10].
>>>> >
>>>> > Issues noted in RC1 vote proposal [13] have now been resolved.
>>>> >
>>>> > The staging area is available for your review, which includes:
>>>> > * GitHub Release notes [1],
>>>> > * the official Apache source release to be deployed to
>>>> dist.apache.org [2],
>>>> > which is signed with the key with fingerprint 02677FF4371A3756 (
>>>> > lostl...@apache.org) or D20316F712213422
>>>> > (GitHub Action automated) [[3],
>>>> > * all artifacts to be deployed to the Maven Central Repository [4],
>>>> > * source code tag "v2.50.0-RC2" [5],
>>>> > * website pull request listing the release [6], the blog post [6], and
>>>> > publishing the API reference manual [7].
>>>> > * Java artifacts were built with Gradle 7.5.1 and OpenJDK
>>>> (Temurin)(build
>>>> > 1.8.0_382-b05).
>>>> > * Python artifacts are deployed along with the source release to the
>>>> > dist.apache.org [2] and PyPI[8].
>>>> > * Go artifacts and documentation are available at pkg.go.dev [9]
>>>> > * Validation sheet with a tab for 2.50.0 release to help with
>>>> validation
>>>> > [10].
>>>> > * Docker images published to Docker Hub [11].
>>>> > * PR to run tests against release branch [12].
>>>> >
>>>> > The vote will be open for at least 72 hours. It is adopted by majority
>>>> > approval, with at least 3 PMC affirmative votes.
>>>> >
>>>> > For guidelines on how to try the release in your projects, check out
>>>> our
>>>> > blog post at https://beam.apache.org/blog/validate-beam-release/.
>>>> >
>>>> > Thanks,
>>>> > Robert Burke
>>>> > Apache Beam 2.50.0 Release Manager
>>>> >
>>>> > [1] https://github.com/apache/beam/milestone/14
>>>> > [2] https://dist.apache.org/repos/dist/dev/beam/2.50.0/
>>>> > [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>>>> > [4]
>>>> https://repository.apache.org/content/repositories/orgapachebeam-1355/
>>>> > [5] https://github.com/apache/beam/tree/v2.50.0-RC2
>>>> > [6] https://github.com/apache/beam/pull/28055
>>>> > [7] https://github.com/apache/beam-site/pull/648
>>>> > [8] https://pypi.org/project/apache-beam/2.50.0rc2/
>>>> > [9]
>>>> >
>>>> https://pkg.go.dev/github.com/apache/beam/sdks/v2@v2.50.0-RC2/go/pkg/beam
>>>> > [10]
>>>> >
>>>> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=1014811464
>>>> > [11] https://hub.docker.com/search?q=apache%2Fbeam=image
>>>> > [12] https://github.com/apache/beam/pull/27962
>>>> > [13] https://lists.apache.org/thread/xgx49zshms7253lfx6d6lsnvwf7tyyfp
>>>> >
>>>>
>>>


Re: [ANNOUNCE] New committer: Ahmed Abualsaud

2023-08-24 Thread Robert Burke
Congratulations Ahmed!!

On Thu, Aug 24, 2023, 4:08 PM Chamikara Jayalath via dev <
dev@beam.apache.org> wrote:

> Congrats Ahmed!!
>
> On Thu, Aug 24, 2023 at 4:06 PM Bruno Volpato via dev 
> wrote:
>
>> Congratulations, Ahmed!
>>
>> Very well deserved!
>>
>>
>> On Thu, Aug 24, 2023 at 6:09 PM XQ Hu via dev 
>> wrote:
>>
>>> Congratulations, Ahmed!
>>>
>>> On Thu, Aug 24, 2023, 5:49 PM Ahmet Altay via dev 
>>> wrote:
>>>
 Hi all,

 Please join me and the rest of the Beam PMC in welcoming a new
 committer: Ahmed Abualsaud (ahmedabuals...@apache.org).

 Ahmed has been part of the Beam community since January 2022, working
 mostly on IO connectors, made a large amount of contributions to make Beam
 IOs more usable, performant, and reliable. And at the same time Ahmed was
 active in the user list and at the Beam summit helping users by sharing his
 knowledge.

 Considering their contributions to the project over this timeframe, the
 Beam PMC trusts Ahmed with the responsibilities of a Beam committer. [1]

 Thank you Ahmed! And we are looking to see more of your contributions!

 Ahmet, on behalf of the Apache Beam PMC

 [1]

 https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer




Re: [VOTE] Release 2.50.0, release candidate #2

2023-08-24 Thread Robert Burke
Two minor erata from the previous email:

The validation spreadsheet link should be: 
https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=1014811464

And the source code tag is: "v2.50.0-RC2"

On 2023/08/24 23:09:23 Robert Burke wrote:
> Hi everyone,
> Please review and vote on the release candidate #2 for the version 2.50.0,
> as follows:
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
> 
> 
> Reviewers are encouraged to test their own use cases with the release
> candidate, and vote +1 if
> no issues are found. Only PMC member votes will count towards the final
> vote, but votes from all
> community members is encouraged and helpful for finding regressions; you
> can either test your own
> use cases or use cases from the validation sheet [10].
> 
> Issues noted in RC1 vote proposal [13] have now been resolved.
> 
> The staging area is available for your review, which includes:
> * GitHub Release notes [1],
> * the official Apache source release to be deployed to dist.apache.org [2],
> which is signed with the key with fingerprint 02677FF4371A3756 (
> lostl...@apache.org) or D20316F712213422
> (GitHub Action automated) [[3],
> * all artifacts to be deployed to the Maven Central Repository [4],
> * source code tag "v2.50.0-RC2" [5],
> * website pull request listing the release [6], the blog post [6], and
> publishing the API reference manual [7].
> * Java artifacts were built with Gradle 7.5.1 and OpenJDK (Temurin)(build
> 1.8.0_382-b05).
> * Python artifacts are deployed along with the source release to the
> dist.apache.org [2] and PyPI[8].
> * Go artifacts and documentation are available at pkg.go.dev [9]
> * Validation sheet with a tab for 2.50.0 release to help with validation
> [10].
> * Docker images published to Docker Hub [11].
> * PR to run tests against release branch [12].
> 
> The vote will be open for at least 72 hours. It is adopted by majority
> approval, with at least 3 PMC affirmative votes.
> 
> For guidelines on how to try the release in your projects, check out our
> blog post at https://beam.apache.org/blog/validate-beam-release/.
> 
> Thanks,
> Robert Burke
> Apache Beam 2.50.0 Release Manager
> 
> [1] https://github.com/apache/beam/milestone/14
> [2] https://dist.apache.org/repos/dist/dev/beam/2.50.0/
> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> [4] https://repository.apache.org/content/repositories/orgapachebeam-1355/
> [5] https://github.com/apache/beam/tree/v2.50.0-RC2
> [6] https://github.com/apache/beam/pull/28055
> [7] https://github.com/apache/beam-site/pull/648
> [8] https://pypi.org/project/apache-beam/2.50.0rc2/
> [9]
> https://pkg.go.dev/github.com/apache/beam/sdks/v2@v2.50.0-RC2/go/pkg/beam
> [10]
> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=1014811464
> [11] https://hub.docker.com/search?q=apache%2Fbeam=image
> [12] https://github.com/apache/beam/pull/27962
> [13] https://lists.apache.org/thread/xgx49zshms7253lfx6d6lsnvwf7tyyfp
> 


[VOTE] Release 2.50.0, release candidate #2

2023-08-24 Thread Robert Burke
Hi everyone,
Please review and vote on the release candidate #2 for the version 2.50.0,
as follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)


Reviewers are encouraged to test their own use cases with the release
candidate, and vote +1 if
no issues are found. Only PMC member votes will count towards the final
vote, but votes from all
community members is encouraged and helpful for finding regressions; you
can either test your own
use cases or use cases from the validation sheet [10].

Issues noted in RC1 vote proposal [13] have now been resolved.

The staging area is available for your review, which includes:
* GitHub Release notes [1],
* the official Apache source release to be deployed to dist.apache.org [2],
which is signed with the key with fingerprint 02677FF4371A3756 (
lostl...@apache.org) or D20316F712213422
(GitHub Action automated) [[3],
* all artifacts to be deployed to the Maven Central Repository [4],
* source code tag "v2.50.0-RC1" [5],
* website pull request listing the release [6], the blog post [6], and
publishing the API reference manual [7].
* Java artifacts were built with Gradle 7.5.1 and OpenJDK (Temurin)(build
1.8.0_382-b05).
* Python artifacts are deployed along with the source release to the
dist.apache.org [2] and PyPI[8].
* Go artifacts and documentation are available at pkg.go.dev [9]
* Validation sheet with a tab for 2.50.0 release to help with validation
[10].
* Docker images published to Docker Hub [11].
* PR to run tests against release branch [12].

The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PMC affirmative votes.

For guidelines on how to try the release in your projects, check out our
blog post at https://beam.apache.org/blog/validate-beam-release/.

Thanks,
Robert Burke
Apache Beam 2.50.0 Release Manager

[1] https://github.com/apache/beam/milestone/14
[2] https://dist.apache.org/repos/dist/dev/beam/2.50.0/
[3] https://dist.apache.org/repos/dist/release/beam/KEYS
[4] https://repository.apache.org/content/repositories/orgapachebeam-1355/
[5] https://github.com/apache/beam/tree/v2.50.0-RC2
[6] https://github.com/apache/beam/pull/28055
[7] https://github.com/apache/beam-site/pull/648
[8] https://pypi.org/project/apache-beam/2.50.0rc2/
[9]
https://pkg.go.dev/github.com/apache/beam/sdks/v2@v2.50.0-RC2/go/pkg/beam
[10]
https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=
.
..
[11] https://hub.docker.com/search?q=apache%2Fbeam=image
[12] https://github.com/apache/beam/pull/27962
[13] https://lists.apache.org/thread/xgx49zshms7253lfx6d6lsnvwf7tyyfp


Re: [VOTE] Release 2.50.0, release candidate #1

2023-08-22 Thread Robert Burke
Due to the severity of this issue, I am closing the vote on RC1 to produce an 
RC2 with the python issue resolved. There will be some other minor cherry picks 
done in line with this, and resolving some issues with the Github Actions

That said, this doesn't affect the Java or Go SDKs, so please continue 
validation with those using RC1 in the mean time.

Robert Burke
Apache Beam 2.50.0 Release Manager

On 2023/08/22 01:35:45 Valentyn Tymofieiev via dev wrote:
> I tried running a Dataflow Python pipeline on RC1  and got an error:
> 
> Pipeline construction environment and pipeline runtime environment are not
> compatible. If you use a custom container image, check that the Python
> interpreter minor version and the Apache Beam version in your image match
> the versions used at pipeline construction time. Submission environment:
> beam:version:sdk_base:apache/beam_python3.11_sdk:2.50.0rc1. Runtime
> environment: beam:version:sdk_base:apache/beam_python3.11_sdk:2.50.0.
> Worker ID: beamapp-valentyn-08220117-08211817-m76c-harness-v38w
> 
> Opened https://github.com/apache/beam/issues/28084 to track.
> 
> 
> On Mon, Aug 21, 2023 at 10:02 AM Robert Burke  wrote:
> 
> > Hi Beamers,
> >
> > Today I'm working on the aforementioned gaps in this RC blocking.
> >
> > However, it's still valuable to validate and vote on the remainder of the
> > RC in order to ensure a timely 2.50.0 release, and finding whether we'll
> > need an RC2 or not.
> >
> > Robert Burke
> > Apache Beam 2.50.0 Release Manager
> >
> > On 2023/08/18 00:58:00 Robert Burke wrote:
> > > Hi everyone,
> > > Please review and vote on the release candidate #1 for the version
> > 2.50.0,
> > > as follows:
> > > [ ] +1, Approve the release
> > > [ ] -1, Do not approve the release (please provide specific comments)
> > >
> > >
> > > Reviewers are encouraged to test their own use cases with the release
> > > candidate, and vote +1 if
> > > no issues are found. Only PMC member votes will count towards the final
> > > vote, but votes from all
> > > community members is encouraged and helpful for finding regressions; you
> > > can either test your own
> > > use cases or use cases from the validation sheet [10].
> > >
> > > Additional notes about this RC:
> > >
> > > * There were issues in starting Dataflow clones portable containers to
> > > Google Container Repository and Google Artifact Registry, so those images
> > > may not yet be available at those locations, which may impact starting
> > jobs
> > > with the RC against Google Cloud Dataflow.
> > >   * This may be worked around by explicitly setting the portable
> > container
> > > to use with the --sdkContainerImage flag for Java, or the
> > > --environment_config flag for Python and Go.
> > > * Due to an issue with my build environment, there were issues producing
> > > two artifacts for this RC.
> > >   * The Typescript SDK container has not yet been built or pushed. As an
> > > experimental SDK this is not a release blocker. However, one will
> > > eventually be published. In the meantime, the 2.49.0 container should be
> > > sufficient.
> > >   * Due to an issue with my build environment, the PyDocs are not
> > currently
> > > part of the Documentation PR update.  This will block the final release
> > of
> > > 2.50.0
> > >   * The current plan is to spend improve the Github Actions for releases
> > to
> > > be able to provide these artifacts, instead of performing a local fix to
> > my
> > > environment, to simplify further releases.
> > >
> > >
> > > The staging area is available for your review, which includes:
> > > * GitHub Release notes [1],
> > > * the official Apache source release to be deployed to dist.apache.org
> > [2],
> > > which is signed with the key with fingerprint 02677FF4371A3756 (
> > > lostl...@apache.org)  or D20316F712213422
> > > (GitHub Action automated) [[3],
> > > * all artifacts to be deployed to the Maven Central Repository [4],
> > > * source code tag "v2.50.0-RC1" [5],
> > > * website pull request listing the release [6], the blog post [6], and
> > > publishing the API reference manual [7].
> > > * Java artifacts were built with Gradle 7.5.1 and OpenJDK (Temurin)(build
> > > 1.8.0_382-b05).
> > > * Python artifacts are deployed along with the source release to the
> > > dist.apache.org [2] and PyPI[8].
> > > * Go artifacts and

Re: [VOTE] Release 2.50.0, release candidate #1

2023-08-21 Thread Robert Burke
Hi Beamers,

Today I'm working on the aforementioned gaps in this RC blocking. 

However, it's still valuable to validate and vote on the remainder of the RC in 
order to ensure a timely 2.50.0 release, and finding whether we'll need an RC2 
or not.

Robert Burke
Apache Beam 2.50.0 Release Manager

On 2023/08/18 00:58:00 Robert Burke wrote:
> Hi everyone,
> Please review and vote on the release candidate #1 for the version 2.50.0,
> as follows:
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
> 
> 
> Reviewers are encouraged to test their own use cases with the release
> candidate, and vote +1 if
> no issues are found. Only PMC member votes will count towards the final
> vote, but votes from all
> community members is encouraged and helpful for finding regressions; you
> can either test your own
> use cases or use cases from the validation sheet [10].
> 
> Additional notes about this RC:
> 
> * There were issues in starting Dataflow clones portable containers to
> Google Container Repository and Google Artifact Registry, so those images
> may not yet be available at those locations, which may impact starting jobs
> with the RC against Google Cloud Dataflow.
>   * This may be worked around by explicitly setting the portable container
> to use with the --sdkContainerImage flag for Java, or the
> --environment_config flag for Python and Go.
> * Due to an issue with my build environment, there were issues producing
> two artifacts for this RC.
>   * The Typescript SDK container has not yet been built or pushed. As an
> experimental SDK this is not a release blocker. However, one will
> eventually be published. In the meantime, the 2.49.0 container should be
> sufficient.
>   * Due to an issue with my build environment, the PyDocs are not currently
> part of the Documentation PR update.  This will block the final release of
> 2.50.0
>   * The current plan is to spend improve the Github Actions for releases to
> be able to provide these artifacts, instead of performing a local fix to my
> environment, to simplify further releases.
> 
> 
> The staging area is available for your review, which includes:
> * GitHub Release notes [1],
> * the official Apache source release to be deployed to dist.apache.org [2],
> which is signed with the key with fingerprint 02677FF4371A3756 (
> lostl...@apache.org)  or D20316F712213422
> (GitHub Action automated) [[3],
> * all artifacts to be deployed to the Maven Central Repository [4],
> * source code tag "v2.50.0-RC1" [5],
> * website pull request listing the release [6], the blog post [6], and
> publishing the API reference manual [7].
> * Java artifacts were built with Gradle 7.5.1 and OpenJDK (Temurin)(build
> 1.8.0_382-b05).
> * Python artifacts are deployed along with the source release to the
> dist.apache.org [2] and PyPI[8].
> * Go artifacts and documentation are available at pkg.go.dev [9]
> * Validation sheet with a tab for 2.50.0 release to help with validation
> [10].
> * Docker images published to Docker Hub [11].
> * PR to run tests against release branch [12].
> 
> The vote will be open for at least 72 hours. It is adopted by majority
> approval, with at least 3 PMC affirmative votes.
> 
> For guidelines on how to try the release in your projects, check out our
> blog post at https://beam.apache.org/blog/validate-beam-release/.
> 
> Thanks,
> Robert Burke
> Apache Beam 2.50.0 Release Manager
> 
> [1] https://github.com/apache/beam/milestone/14
> [2] https://dist.apache.org/repos/dist/dev/beam/2.50.0/
> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
> [4] https://repository.apache.org/content/repositories/orgapachebeam-1353/
> [5] https://github.com/apache/beam/tree/v2.50.0-RC1
> [6] https://github.com/apache/beam/pull/28055
> [7] https://github.com/apache/beam-site/pull/647
> [8] https://pypi.org/project/apache-beam/2.50.0rc1/
> [9]
> https://pkg.go.dev/github.com/apache/beam/sdks/v2@v2.50.0-RC1/go/pkg/beam
> [10]
> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=.
> ..
> [11] https://hub.docker.com/search?q=apache%2Fbeam=image
> [12] https://github.com/apache/beam/pull/27962
> 


Re: [PROPOSAL] Preparing for 2.50.0 Release

2023-08-17 Thread Robert Burke
RC1 is sufficiently ready for testing and validation. Please vote and discuss 
at that thread.

https://lists.apache.org/thread/xgx49zshms7253lfx6d6lsnvwf7tyyfp

Robert Burke
2.50.0 Release Manager

On 2023/08/17 02:30:00 Robert Burke wrote:
> Despite my best efforts, python continues to vex me.  RC1 is almost ready,
> just missing the beam site and doc updates PR, and (optionally) the
> typescript container.
> 
> So I'm calling it a night, and will build and send out a partial docs PR in
> the morning.
> Robert Burke
> 2.50.0 Release Manager
> 
> On Wed, Aug 16, 2023, 8:08 AM Robert Burke  wrote:
> 
> > Just a status update: Branch is cut and tagged
> >
> > https://github.com/apache/beam/tree/release-2.50.0
> > https://github.com/apache/beam/tree/v2.50.0-RC1
> >
> > I'm working on the remaining bits to have an RC. The github
> > build-release-artifacts action failed to
> > build and publish the Java Artifacts and stage the Docker containers.
> >
> > The former says:
> >
> > Execution failed for task ':sdks:java:io:solr:compileTestJava'.
> > GC overhead limit exceeded
> >
> > The latter is due to a partial application of the Multi-Arch build to the
> > github actions, that has already been fixed.
> >
> > The Dataflow Legacy Java worker and associated containers have been built
> > and published, and we apologize for the delay this caused. We're discussing
> > how we presently interleave Google internal processes with the release, and
> > how we can streamline things now that Dataflow is transitioning to RunnerV2
> > by default. In future releases, we may build the non-portable Dataflow Java
> > workers after the first RC is tagged and the open side is on its way.
> >
> > The hope is RC1 will be available tonight. Either way, this thread will be
> > updated with the status.
> >
> > Robert Burke
> > Beam 2.50.0 Release Manager
> >
> > On 2023/08/14 21:51:47 Robert Burke wrote:
> > > +1 to what XQ says.
> > >
> > > There will be a voting email thread once I've done the appropriate due
> > > diligence to the branch, and finish with the Dataflow artifacts.
> > >
> > > Generally speaking, the best validation is something you're using
> > already,
> > > to make sure that the new version of Beam works for your usage.
> > >
> > >
> > > On Mon, Aug 14, 2023, 2:41 PM XQ Hu via dev  wrote:
> > >
> > > > Welcome to the Beam community! Our release managers usually follow this
> > > >
> > https://beam.apache.org/contribute/release-guide/#10-vote-and-validate-release-candidate
> > > > to send the votes out and ask for any feedback regarding the release
> > > > candidate. If you could help run any validation on your side and cast
> > your
> > > > vote, it would be greatly appreciated and helpful for the community.
> > > >
> > > > On Mon, Aug 14, 2023 at 12:23 PM Hong  wrote:
> > > >
> > > >> I see, thanks for clarifying, Robert!
> > > >>
> > > >> Is there anything I can help with validation? Is there a wiki page
> > with
> > > >> the expected validations I can help with?
> > > >>
> > > >> Best
> > > >> Hong
> > > >>
> > > >> On 14 Aug 2023, at 14:34, Robert Burke  wrote:
> > > >>
> > > >> 
> > > >> The release branch was cut. Before yhe weekend, I was working on
> > getting
> > > >> the non-portable Dataflow Java worker built and available before
> > producing
> > > >> the RC1. The actual building bit doesn't take that long, but there's a
> > > >> bunch of additional validation that goes along with it.
> > > >>
> > > >> The current target date for 2.50.0 is September 13th, but ultimately
> > it's
> > > >> as soon as we have a validated and voted on RC.
> > > >>
> > > >> On Mon, Aug 14, 2023, 3:43 AM Hong Liang  wrote:
> > > >>
> > > >>> Thanks for driving this Robert!
> > > >>>
> > > >>> It seems the two PRs specified have been merged. A little new to
> > Beam,
> > > >>> do we have an expected release date for the 2.50 release?
> > > >>>
> > > >>> Best,
> > > >>> Hong
> > > >>>
> > > >>> On Thu, Aug 10, 2023 at 3:08 AM Robert Burke 
> > > >>> wrote:
> > > >>

[VOTE] Release 2.50.0, release candidate #1

2023-08-17 Thread Robert Burke
Hi everyone,
Please review and vote on the release candidate #1 for the version 2.50.0,
as follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)


Reviewers are encouraged to test their own use cases with the release
candidate, and vote +1 if
no issues are found. Only PMC member votes will count towards the final
vote, but votes from all
community members is encouraged and helpful for finding regressions; you
can either test your own
use cases or use cases from the validation sheet [10].

Additional notes about this RC:

* There were issues in starting Dataflow clones portable containers to
Google Container Repository and Google Artifact Registry, so those images
may not yet be available at those locations, which may impact starting jobs
with the RC against Google Cloud Dataflow.
  * This may be worked around by explicitly setting the portable container
to use with the --sdkContainerImage flag for Java, or the
--environment_config flag for Python and Go.
* Due to an issue with my build environment, there were issues producing
two artifacts for this RC.
  * The Typescript SDK container has not yet been built or pushed. As an
experimental SDK this is not a release blocker. However, one will
eventually be published. In the meantime, the 2.49.0 container should be
sufficient.
  * Due to an issue with my build environment, the PyDocs are not currently
part of the Documentation PR update.  This will block the final release of
2.50.0
  * The current plan is to spend improve the Github Actions for releases to
be able to provide these artifacts, instead of performing a local fix to my
environment, to simplify further releases.


The staging area is available for your review, which includes:
* GitHub Release notes [1],
* the official Apache source release to be deployed to dist.apache.org [2],
which is signed with the key with fingerprint 02677FF4371A3756 (
lostl...@apache.org)  or D20316F712213422
(GitHub Action automated) [[3],
* all artifacts to be deployed to the Maven Central Repository [4],
* source code tag "v2.50.0-RC1" [5],
* website pull request listing the release [6], the blog post [6], and
publishing the API reference manual [7].
* Java artifacts were built with Gradle 7.5.1 and OpenJDK (Temurin)(build
1.8.0_382-b05).
* Python artifacts are deployed along with the source release to the
dist.apache.org [2] and PyPI[8].
* Go artifacts and documentation are available at pkg.go.dev [9]
* Validation sheet with a tab for 2.50.0 release to help with validation
[10].
* Docker images published to Docker Hub [11].
* PR to run tests against release branch [12].

The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PMC affirmative votes.

For guidelines on how to try the release in your projects, check out our
blog post at https://beam.apache.org/blog/validate-beam-release/.

Thanks,
Robert Burke
Apache Beam 2.50.0 Release Manager

[1] https://github.com/apache/beam/milestone/14
[2] https://dist.apache.org/repos/dist/dev/beam/2.50.0/
[3] https://dist.apache.org/repos/dist/release/beam/KEYS
[4] https://repository.apache.org/content/repositories/orgapachebeam-1353/
[5] https://github.com/apache/beam/tree/v2.50.0-RC1
[6] https://github.com/apache/beam/pull/28055
[7] https://github.com/apache/beam-site/pull/647
[8] https://pypi.org/project/apache-beam/2.50.0rc1/
[9]
https://pkg.go.dev/github.com/apache/beam/sdks/v2@v2.50.0-RC1/go/pkg/beam
[10]
https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=.
..
[11] https://hub.docker.com/search?q=apache%2Fbeam=image
[12] https://github.com/apache/beam/pull/27962


Re: [PROPOSAL] Preparing for 2.50.0 Release

2023-08-16 Thread Robert Burke
Despite my best efforts, python continues to vex me.  RC1 is almost ready,
just missing the beam site and doc updates PR, and (optionally) the
typescript container.

So I'm calling it a night, and will build and send out a partial docs PR in
the morning.
Robert Burke
2.50.0 Release Manager

On Wed, Aug 16, 2023, 8:08 AM Robert Burke  wrote:

> Just a status update: Branch is cut and tagged
>
> https://github.com/apache/beam/tree/release-2.50.0
> https://github.com/apache/beam/tree/v2.50.0-RC1
>
> I'm working on the remaining bits to have an RC. The github
> build-release-artifacts action failed to
> build and publish the Java Artifacts and stage the Docker containers.
>
> The former says:
>
> Execution failed for task ':sdks:java:io:solr:compileTestJava'.
> GC overhead limit exceeded
>
> The latter is due to a partial application of the Multi-Arch build to the
> github actions, that has already been fixed.
>
> The Dataflow Legacy Java worker and associated containers have been built
> and published, and we apologize for the delay this caused. We're discussing
> how we presently interleave Google internal processes with the release, and
> how we can streamline things now that Dataflow is transitioning to RunnerV2
> by default. In future releases, we may build the non-portable Dataflow Java
> workers after the first RC is tagged and the open side is on its way.
>
> The hope is RC1 will be available tonight. Either way, this thread will be
> updated with the status.
>
> Robert Burke
> Beam 2.50.0 Release Manager
>
> On 2023/08/14 21:51:47 Robert Burke wrote:
> > +1 to what XQ says.
> >
> > There will be a voting email thread once I've done the appropriate due
> > diligence to the branch, and finish with the Dataflow artifacts.
> >
> > Generally speaking, the best validation is something you're using
> already,
> > to make sure that the new version of Beam works for your usage.
> >
> >
> > On Mon, Aug 14, 2023, 2:41 PM XQ Hu via dev  wrote:
> >
> > > Welcome to the Beam community! Our release managers usually follow this
> > >
> https://beam.apache.org/contribute/release-guide/#10-vote-and-validate-release-candidate
> > > to send the votes out and ask for any feedback regarding the release
> > > candidate. If you could help run any validation on your side and cast
> your
> > > vote, it would be greatly appreciated and helpful for the community.
> > >
> > > On Mon, Aug 14, 2023 at 12:23 PM Hong  wrote:
> > >
> > >> I see, thanks for clarifying, Robert!
> > >>
> > >> Is there anything I can help with validation? Is there a wiki page
> with
> > >> the expected validations I can help with?
> > >>
> > >> Best
> > >> Hong
> > >>
> > >> On 14 Aug 2023, at 14:34, Robert Burke  wrote:
> > >>
> > >> 
> > >> The release branch was cut. Before yhe weekend, I was working on
> getting
> > >> the non-portable Dataflow Java worker built and available before
> producing
> > >> the RC1. The actual building bit doesn't take that long, but there's a
> > >> bunch of additional validation that goes along with it.
> > >>
> > >> The current target date for 2.50.0 is September 13th, but ultimately
> it's
> > >> as soon as we have a validated and voted on RC.
> > >>
> > >> On Mon, Aug 14, 2023, 3:43 AM Hong Liang  wrote:
> > >>
> > >>> Thanks for driving this Robert!
> > >>>
> > >>> It seems the two PRs specified have been merged. A little new to
> Beam,
> > >>> do we have an expected release date for the 2.50 release?
> > >>>
> > >>> Best,
> > >>> Hong
> > >>>
> > >>> On Thu, Aug 10, 2023 at 3:08 AM Robert Burke 
> > >>> wrote:
> > >>>
> > >>>> I'm in the process of producing the Cut branch, but due to various
> > >>>> delays on my part, it will not be cut today.
> > >>>>
> > >>>> There are two outstanding PRs blocking the cut,
> > >>>> https://github.com/apache/beam/pull/27947 and
> > >>>> https://github.com/apache/beam/pull/27939, but once those are in,
> I'll
> > >>>> proceed. Remaining new issues will be cherry picked as required.
> > >>>>
> > >>>> Thanks
> > >>>> Robert Burke
> > >>>> Beam 2.50.0 Release Manager
> > >>>>
> > &

Re: [RFC] Bootloader Buffered Logging

2023-08-16 Thread Robert Burke
I've added some comments but generally +1 on this.

A later change might be able to build from this to ensure the various
STDErr and STDOut logs from the SDK harness executions are always plumbed
as described.

But that would take more thought since other incidental logs from the users
worker binary (sic) might be misconstrued as serious when they were largely
benign noise previously ignored (since they were invisible).

On Wed, Aug 16, 2023, 9:57 AM Jack McCluskey via dev 
wrote:

> Hey everyone,
>
> I've written a small design doc around implementing some buffered logging
> for the Beam boot.go scripts that is available at
> https://s.apache.org/beam-buffered-logging. This should help surface
> errors that occur during worker set-up (like issues with dependency
> installation) that tend to be logged improperly at INFO.
>
> Thanks,
>
> Jack McCluskey
>
> --
>
>
> Jack McCluskey
> SWE - DataPLS PLAT/ Dataflow ML
> RDU
> jrmcclus...@google.com
>
>
>


Re: [PROPOSAL] Preparing for 2.50.0 Release

2023-08-16 Thread Robert Burke
Just a status update: Branch is cut and tagged

https://github.com/apache/beam/tree/release-2.50.0
https://github.com/apache/beam/tree/v2.50.0-RC1

I'm working on the remaining bits to have an RC. The github 
build-release-artifacts action failed to 
build and publish the Java Artifacts and stage the Docker containers. 

The former says:

Execution failed for task ':sdks:java:io:solr:compileTestJava'.
GC overhead limit exceeded

The latter is due to a partial application of the Multi-Arch build to the 
github actions, that has already been fixed.

The Dataflow Legacy Java worker and associated containers have been built and 
published, and we apologize for the delay this caused. We're discussing how we 
presently interleave Google internal processes with the release, and how we can 
streamline things now that Dataflow is transitioning to RunnerV2 by default. In 
future releases, we may build the non-portable Dataflow Java workers after the 
first RC is tagged and the open side is on its way. 

The hope is RC1 will be available tonight. Either way, this thread will be 
updated with the status.

Robert Burke
Beam 2.50.0 Release Manager

On 2023/08/14 21:51:47 Robert Burke wrote:
> +1 to what XQ says.
> 
> There will be a voting email thread once I've done the appropriate due
> diligence to the branch, and finish with the Dataflow artifacts.
> 
> Generally speaking, the best validation is something you're using already,
> to make sure that the new version of Beam works for your usage.
> 
> 
> On Mon, Aug 14, 2023, 2:41 PM XQ Hu via dev  wrote:
> 
> > Welcome to the Beam community! Our release managers usually follow this
> > https://beam.apache.org/contribute/release-guide/#10-vote-and-validate-release-candidate
> > to send the votes out and ask for any feedback regarding the release
> > candidate. If you could help run any validation on your side and cast your
> > vote, it would be greatly appreciated and helpful for the community.
> >
> > On Mon, Aug 14, 2023 at 12:23 PM Hong  wrote:
> >
> >> I see, thanks for clarifying, Robert!
> >>
> >> Is there anything I can help with validation? Is there a wiki page with
> >> the expected validations I can help with?
> >>
> >> Best
> >> Hong
> >>
> >> On 14 Aug 2023, at 14:34, Robert Burke  wrote:
> >>
> >> 
> >> The release branch was cut. Before yhe weekend, I was working on getting
> >> the non-portable Dataflow Java worker built and available before producing
> >> the RC1. The actual building bit doesn't take that long, but there's a
> >> bunch of additional validation that goes along with it.
> >>
> >> The current target date for 2.50.0 is September 13th, but ultimately it's
> >> as soon as we have a validated and voted on RC.
> >>
> >> On Mon, Aug 14, 2023, 3:43 AM Hong Liang  wrote:
> >>
> >>> Thanks for driving this Robert!
> >>>
> >>> It seems the two PRs specified have been merged. A little new to Beam,
> >>> do we have an expected release date for the 2.50 release?
> >>>
> >>> Best,
> >>> Hong
> >>>
> >>> On Thu, Aug 10, 2023 at 3:08 AM Robert Burke 
> >>> wrote:
> >>>
> >>>> I'm in the process of producing the Cut branch, but due to various
> >>>> delays on my part, it will not be cut today.
> >>>>
> >>>> There are two outstanding PRs blocking the cut,
> >>>> https://github.com/apache/beam/pull/27947 and
> >>>> https://github.com/apache/beam/pull/27939, but once those are in, I'll
> >>>> proceed. Remaining new issues will be cherry picked as required.
> >>>>
> >>>> Thanks
> >>>> Robert Burke
> >>>> Beam 2.50.0 Release Manager
> >>>>
> >>>> On 2023/07/26 15:49:37 Robert Burke wrote:
> >>>> > Hey Beam community,
> >>>> >
> >>>> > The next release (2.50.0) branch cut is scheduled on August 9th, 2023,
> >>>> > according to
> >>>> > the release calendar [1].
> >>>> >
> >>>> > I volunteer to perform this release. My plan is to cut the branch on
> >>>> that
> >>>> > date, and cherrypick release-blocking fixes afterwards, if any.
> >>>> >
> >>>> > Please help me make sure the release goes smoothly by:
> >>>> > - Making sure that any unresolved release blocking issues for 2.50.0
> >>>> should
> >>>> > have their "Milestone" marked as "2.50.0 Release" as soon as possible.
> >>>> > - Reviewing the current release blockers [2] and remove the Milestone
> >>>> if
> >>>> > they don't meet the criteria at [3].
> >>>> >
> >>>> > Let me know if you have any comments/objections/questions.
> >>>> >
> >>>> > Thanks,
> >>>> >
> >>>> > Robert Burke (he/him)
> >>>> > Beam Go Busybody
> >>>> >
> >>>> > [1]
> >>>> >
> >>>> https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com
> >>>> > [2] https://github.com/apache/beam/milestone/14
> >>>> > [3] https://beam.apache.org/contribute/release-blocking/
> >>>> >
> >>>>
> >>>
> 


Re: [PROPOSAL] Preparing for 2.50.0 Release

2023-08-14 Thread Robert Burke
+1 to what XQ says.

There will be a voting email thread once I've done the appropriate due
diligence to the branch, and finish with the Dataflow artifacts.

Generally speaking, the best validation is something you're using already,
to make sure that the new version of Beam works for your usage.


On Mon, Aug 14, 2023, 2:41 PM XQ Hu via dev  wrote:

> Welcome to the Beam community! Our release managers usually follow this
> https://beam.apache.org/contribute/release-guide/#10-vote-and-validate-release-candidate
> to send the votes out and ask for any feedback regarding the release
> candidate. If you could help run any validation on your side and cast your
> vote, it would be greatly appreciated and helpful for the community.
>
> On Mon, Aug 14, 2023 at 12:23 PM Hong  wrote:
>
>> I see, thanks for clarifying, Robert!
>>
>> Is there anything I can help with validation? Is there a wiki page with
>> the expected validations I can help with?
>>
>> Best
>> Hong
>>
>> On 14 Aug 2023, at 14:34, Robert Burke  wrote:
>>
>> 
>> The release branch was cut. Before yhe weekend, I was working on getting
>> the non-portable Dataflow Java worker built and available before producing
>> the RC1. The actual building bit doesn't take that long, but there's a
>> bunch of additional validation that goes along with it.
>>
>> The current target date for 2.50.0 is September 13th, but ultimately it's
>> as soon as we have a validated and voted on RC.
>>
>> On Mon, Aug 14, 2023, 3:43 AM Hong Liang  wrote:
>>
>>> Thanks for driving this Robert!
>>>
>>> It seems the two PRs specified have been merged. A little new to Beam,
>>> do we have an expected release date for the 2.50 release?
>>>
>>> Best,
>>> Hong
>>>
>>> On Thu, Aug 10, 2023 at 3:08 AM Robert Burke 
>>> wrote:
>>>
>>>> I'm in the process of producing the Cut branch, but due to various
>>>> delays on my part, it will not be cut today.
>>>>
>>>> There are two outstanding PRs blocking the cut,
>>>> https://github.com/apache/beam/pull/27947 and
>>>> https://github.com/apache/beam/pull/27939, but once those are in, I'll
>>>> proceed. Remaining new issues will be cherry picked as required.
>>>>
>>>> Thanks
>>>> Robert Burke
>>>> Beam 2.50.0 Release Manager
>>>>
>>>> On 2023/07/26 15:49:37 Robert Burke wrote:
>>>> > Hey Beam community,
>>>> >
>>>> > The next release (2.50.0) branch cut is scheduled on August 9th, 2023,
>>>> > according to
>>>> > the release calendar [1].
>>>> >
>>>> > I volunteer to perform this release. My plan is to cut the branch on
>>>> that
>>>> > date, and cherrypick release-blocking fixes afterwards, if any.
>>>> >
>>>> > Please help me make sure the release goes smoothly by:
>>>> > - Making sure that any unresolved release blocking issues for 2.50.0
>>>> should
>>>> > have their "Milestone" marked as "2.50.0 Release" as soon as possible.
>>>> > - Reviewing the current release blockers [2] and remove the Milestone
>>>> if
>>>> > they don't meet the criteria at [3].
>>>> >
>>>> > Let me know if you have any comments/objections/questions.
>>>> >
>>>> > Thanks,
>>>> >
>>>> > Robert Burke (he/him)
>>>> > Beam Go Busybody
>>>> >
>>>> > [1]
>>>> >
>>>> https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com
>>>> > [2] https://github.com/apache/beam/milestone/14
>>>> > [3] https://beam.apache.org/contribute/release-blocking/
>>>> >
>>>>
>>>


Re: [PROPOSAL] Preparing for 2.50.0 Release

2023-08-14 Thread Robert Burke
The release branch was cut. Before yhe weekend, I was working on getting
the non-portable Dataflow Java worker built and available before producing
the RC1. The actual building bit doesn't take that long, but there's a
bunch of additional validation that goes along with it.

The current target date for 2.50.0 is September 13th, but ultimately it's
as soon as we have a validated and voted on RC.

On Mon, Aug 14, 2023, 3:43 AM Hong Liang  wrote:

> Thanks for driving this Robert!
>
> It seems the two PRs specified have been merged. A little new to Beam, do
> we have an expected release date for the 2.50 release?
>
> Best,
> Hong
>
> On Thu, Aug 10, 2023 at 3:08 AM Robert Burke  wrote:
>
>> I'm in the process of producing the Cut branch, but due to various delays
>> on my part, it will not be cut today.
>>
>> There are two outstanding PRs blocking the cut,
>> https://github.com/apache/beam/pull/27947 and
>> https://github.com/apache/beam/pull/27939, but once those are in, I'll
>> proceed. Remaining new issues will be cherry picked as required.
>>
>> Thanks
>> Robert Burke
>> Beam 2.50.0 Release Manager
>>
>> On 2023/07/26 15:49:37 Robert Burke wrote:
>> > Hey Beam community,
>> >
>> > The next release (2.50.0) branch cut is scheduled on August 9th, 2023,
>> > according to
>> > the release calendar [1].
>> >
>> > I volunteer to perform this release. My plan is to cut the branch on
>> that
>> > date, and cherrypick release-blocking fixes afterwards, if any.
>> >
>> > Please help me make sure the release goes smoothly by:
>> > - Making sure that any unresolved release blocking issues for 2.50.0
>> should
>> > have their "Milestone" marked as "2.50.0 Release" as soon as possible.
>> > - Reviewing the current release blockers [2] and remove the Milestone if
>> > they don't meet the criteria at [3].
>> >
>> > Let me know if you have any comments/objections/questions.
>> >
>> > Thanks,
>> >
>> > Robert Burke (he/him)
>> > Beam Go Busybody
>> >
>> > [1]
>> >
>> https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com
>> > [2] https://github.com/apache/beam/milestone/14
>> > [3] https://beam.apache.org/contribute/release-blocking/
>> >
>>
>


Re: [PROPOSAL] Preparing for 2.50.0 Release

2023-08-09 Thread Robert Burke
I'm in the process of producing the Cut branch, but due to various delays on my 
part, it will not be cut today.

There are two outstanding PRs blocking the cut, 
https://github.com/apache/beam/pull/27947 and 
https://github.com/apache/beam/pull/27939, but once those are in, I'll proceed. 
Remaining new issues will be cherry picked as required.

Thanks
Robert Burke
Beam 2.50.0 Release Manager

On 2023/07/26 15:49:37 Robert Burke wrote:
> Hey Beam community,
> 
> The next release (2.50.0) branch cut is scheduled on August 9th, 2023,
> according to
> the release calendar [1].
> 
> I volunteer to perform this release. My plan is to cut the branch on that
> date, and cherrypick release-blocking fixes afterwards, if any.
> 
> Please help me make sure the release goes smoothly by:
> - Making sure that any unresolved release blocking issues for 2.50.0 should
> have their "Milestone" marked as "2.50.0 Release" as soon as possible.
> - Reviewing the current release blockers [2] and remove the Milestone if
> they don't meet the criteria at [3].
> 
> Let me know if you have any comments/objections/questions.
> 
> Thanks,
> 
> Robert Burke (he/him)
> Beam Go Busybody
> 
> [1]
> https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com
> [2] https://github.com/apache/beam/milestone/14
> [3] https://beam.apache.org/contribute/release-blocking/
> 


Re: [Discuss] Get rid of OWNERS files

2023-08-08 Thread Robert Burke
Either we keep OWNERS and have the review bot use them, or we remove them
and use the reviews bot config as the single source of truth.

The bot is less likely to go out of date since it's at least active in how
it behaves. I agree it doesn't necessarily solve the problem of things
getting out of date, but other than inactive folks officially, actively
bowing out of the project, I don't know there's anything we can do.

IMO folks who aren't active but are still getting emails and review
requests should be incentivised to redirect requests to new owners or at
least active members.


On Tue, Aug 8, 2023, 9:13 AM Alexey Romanenko 
wrote:

> I’m generally agree with this (initially that was a good intention imho)
> but what could be an alternative for this? Review bot also may assign
> reviewers that are no longer active on the project.
>
> —
> Alexey
>
>
> On 8 Aug 2023, at 16:55, Danny McCormick via dev 
> wrote:
>
> Hey everyone, I'd like to propose getting rid of OWNERS files from the
> Beam repo. Right now, I don't think they are serving a meaningful purpose:
>
> - Many OWNERS files are outdated and point to people who are no longer
> actively involved in the project (examples: 1
> , 2
> , 3
> ,
> there are many more)
> - Many dependencies don't have owners assigned
> - Many major directories function fine without OWNERS files
> - We lack sufficient documentation of what OWNERS files mean (
> https://s.apache.org/beam-owners is not helpful and I couldn't find other
> resources)
> - We now have the review bot to automatically assign reviewers based on
> areas of ownership. That has proven more likely to stay up to date.
>
> Given all of these, I don't see any obvious usefulness for OWNERS files.
> Please chime in if you disagree (or agree). If there are no objections I'll
> assume silent consensus and remove them next week.
>
> Thanks,
> Danny
>
>
>


[PROPOSAL] Preparing for 2.50.0 Release

2023-07-26 Thread Robert Burke
Hey Beam community,

The next release (2.50.0) branch cut is scheduled on August 9th, 2023,
according to
the release calendar [1].

I volunteer to perform this release. My plan is to cut the branch on that
date, and cherrypick release-blocking fixes afterwards, if any.

Please help me make sure the release goes smoothly by:
- Making sure that any unresolved release blocking issues for 2.50.0 should
have their "Milestone" marked as "2.50.0 Release" as soon as possible.
- Reviewing the current release blockers [2] and remove the Milestone if
they don't meet the criteria at [3].

Let me know if you have any comments/objections/questions.

Thanks,

Robert Burke (he/him)
Beam Go Busybody

[1]
https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com
[2] https://github.com/apache/beam/milestone/14
[3] https://beam.apache.org/contribute/release-blocking/


Re: [DISCUSS] Enable Github Discussions?

2023-07-06 Thread Robert Burke
I'm -1 on GitHub discussions.

If anything the user can just file a GitHub issue for the same purpose, if
they prefer the GitHub interface over emails. In theory, GHDiscussions can
be better for very active topics, but honestly I don't think we have that
sort of throughput.

 Having payed attention to a few uses of the Go Programming Language use of
GitHub features, i found the Discussions lead to *more* repetitive threads
and points, since folks are emboldened to not pay attention outside of
their local thread in the discussion.

I can't see much of a meaningful difference between them and using the
Apache Slack instance, which can also have threaded micro discussions.

On Thu, Jul 6, 2023, 11:07 AM Robert Bradshaw via dev 
wrote:

> I'm also -1 on introducing another forum, and concur with Alexey that
> mailing lists are a (required) deep part of the culture for apache
> projects.
>
> If there's something qualitatively and significantly different about
> discussions that makes it a better fit, I would consider it. (E.g. IMHO the
> structure/format of stack overflow lends itself much better to scalable
> user support than a mailing list, which is why it makes sense to be there.)
> The statement about "folks to get[ting] unblocked on small/medium
> implementation blocker" is important, and we should definitely encourage
> people to more actively use the existing lists for this purpose rather than
> having out-of-band discussions when possible which will be helpful to the
> larger community. (Not seeing how this is unique to GH Discussions though.)
>
> (I'm also skeptical of "GH Discussions is more discoverable and
> approachable for new users and contributors." I definitely think it makes
> sense to meet users where they are, but while I know many developers that
> don't actively use github (some don't even have an account), I don't
> (personally) don't know any that don't have an email address which is a
> good lower common denominator. But maybe that just dates me...)
>
>
>
>
>
>
>
> On Wed, Jul 5, 2023 at 7:22 AM Jack McCluskey via dev 
> wrote:
>
>> Also going to be -1 on this one, I'm not sure we pick anything up from
>> adding a forum apart from adding another place that needs to be checked.
>>
>> On Tue, Jul 4, 2023 at 4:03 AM Jan Lukavský  wrote:
>>
>>> -1
>>>
>>> Totally agree with Byron and Alexey.
>>>
>>>  Jan
>>> On 7/3/23 21:18, Byron Ellis via dev wrote:
>>>
>>> -1. This just leads to needless fragmentation not to mention being at
>>> the mercy of a specific technology provider.
>>>
>>> On Mon, Jul 3, 2023 at 11:39 AM XQ Hu via dev 
>>> wrote:
>>>
 +1 with GH discussion.
 If Airflow can do this https://github.com/apache/airflow/discussions,
 I think we can do this as well.

 On Mon, Jul 3, 2023 at 9:51 AM Alexey Romanenko <
 aromanenko@gmail.com> wrote:

> -1
> I understand that for some people, who maybe are not very familiar
> with ASF and its “Apache Way” [1], it may sound a bit obsolete but mailing
> lists are one of the key things of every ASF project which Apache Beam is.
> Having user@, dev@ and commits@ lists are required for ASF project to
> maintain the open discussions that are publicly accessible and archived in
> the same way for all ASF projects.
>
> I just wanted to remind a key motto at Apache Software Foundation is:
>   *“If it didn't happen on the mailing list, it didn't happen.”*
>
> —
> Alexey
>
> [1] https://apache.org/theapacheway/index.html
>
> On 1 Jul 2023, at 19:54, Anand Inguva via dev 
> wrote:
>
> +1 for GitHub discussions as well. But I am also little concerned
> about multiple places for discussions. As Danny said, if we have a good
> plan on how to move forward on how/when to archive the current mailing
> list, that would be great.
>
> Thanks,
> Anand
>
> On Sat, Jul 1, 2023, 3:21 AM Damon Douglas 
> wrote:
>
>> I'm very strong +1 for replacing the use of Email with GitHub
>> Discussions. Thank you for bringing this up.
>>
>> On Fri, Jun 30, 2023 at 7:38 AM Danny McCormick via dev <
>> dev@beam.apache.org> wrote:
>>
>>> Thanks for starting this discussion!
>>>
>>> I'm a weak -1 for this proposal. While I think that GH Discussions
>>> can be a good forum, I think most of the things that Discussions do are
>>> covered by some combination of the dev/user lists and GitHub issues, and
>>> the net outcome of this will be creating one more forum to pay attention
>>> to. I know in the past we've had a hard time keeping up with Stack 
>>> overflow
>>> questions for a similar reason. With that said, I'm not opposed to 
>>> trying
>>> it out and experimenting as long as we have (a) clear criteria for
>>> understanding if the change is effective or not (can be subjective), 
>>> (b) a
>>> clear idea of when we'd revisit the discussion, and (c) a 

Re: Rust SDK design docs/notes

2023-06-21 Thread Robert Burke
I threw in some Go specific comments and links.

 Especially needed to call out that we try to avoid the Magic Type
Synthesis Path for types and Casting pointers to functions after looking
Them Up In the Debug symbol table. For Go at least, both had robustness or
performance issues that may or may not be applicable to Rust.

On Wed, Jun 21, 2023, 3:43 AM Steven van Rossum via dev 
wrote:

> Hi all,
>
> Work continues on a Rust SDK at https://github.com/laysakura/beam and
> design docs/notes are being collected at
> https://github.com/laysakura/beam/wiki/Design-docs if anyone wants to
> leave a comment or get engaged in design.
> It's a bit bare bones right now, but we've got a bunch more topics to
> write about based on some of the discussions we've had at the Beam Summit
> last week.
> I'm currently reviewing the notes at https://s.apache.org/a-new-dofn and
> SDK implementations for a DoFn design and will pour that into a doc soon.
>
> Sho has been adding a number of easy to moderate tasks to the issue
> tracker at https://github.com/laysakura/beam/issues if you're looking to
> get involved.
> I'll make sure to leave a few comments there as well based on my TODOs
> from earlier PRs.
>
> Cheers,
>
> Steve
>


Beam Go now has a v2.48.2 release.

2023-06-08 Thread Robert Burke
Hi Beam Dev List!

This is to report on an issue that occurred with the v2.48.0 Go SDK release
and it's resolution. While generally poor form, Ritesh, Jack, and I
independently resolved the issue instead of first mailing the dev list
about it first. We decided that fixing the error for the Go SDK was better
for the community than delaying such a fix through a discussion and vote.
We do believe the issue is resolved, and there are now sufficient guard
rails against a recurrence.

However, it's still critical we email the community about it, so here it is.

tl;dr;
Due to an error in tagging, the v2.48.0 release was trying to use the wrong
SDK container, and it was still trying to use the ".dev" version. We had to
add a new Go SDK specific tag of `sdks/v2.48.2` to resolve the issue and
ensure that tag was on the right RC commit.

This was tracked in https://github.com/apache/beam/issues/27064,

The longer story:

This morning a user filed  an issue [0]. Due to Go's unique package release
strategy, it's not possible to simply "move the tag to a new commit", since
the module proxy and similar would already have distributed the previous
versions of the source.  This property enables robust "supply chain"
security, and avoids mismatches or maliciousness.

The only resolution to a bad is to release a patch version, which for Go,
is as simple as adding an appropriate tag. The Go SDK has its own "tag
series" prefixed with "sdks/" since that folder is where the SDK's go.mod
file lives. We judged that the cost of the Go SDK version being slightly
out of sync with the main line version to be acceptable, given that Beam
doesn't presently do Patch releases. No other changes were done to avoid a
full container build. Adding a tag version of v2.48.0 to a working commit
would unbreak the Go SDK release.

The error occurred because with 2.48.0, the release manager was using the
new Github Action to get the RC tags instead of the manual script. The
action worked fine however and did that job correctly.

Since the RC_TAG variable in the release guide [1] is unspecified in the
guide, the Release Manager ended up running `git tag -s "sdks/v2.48.0"`
which adds the tag to the HEAD commit of the current branch, instead of to
the commit associated with the RC tag.

So, the Release Manager ended up running the command again, leading to the
same result. A bit of investigation showed that it was possible for Tags to
get out of sync in the local branch, vs what the Github action did. However
this burned the sdks/v2.48.1 tag.

The sync issue was resolved by a ` git fetch --all --tags` and the RC tag
commit confirmed `git rev-list ${RC_TAG}  -n 1`, leading to the 2nd fix
attempt with v2.48.2, which has resolved the Go SDK issue.

The Release Guide has been updated [2] to make checking this explicit,
though hopefully, this step will be obsolete when it's moved to github
actions. But until then, we may as well avoid the error.

The 2.48.0 release blog and notes have been updated to note the discrepancy
as well.

Thank you for your understanding and time,
Robert Burke
Beam Go Busybody

[0] https://github.com/apache/beam/issues/27064
[1] https://beam.apache.org/contribute/release-guide/#git-tag
[2] https://github.com/apache/beam/pull/27070


Re: Client-Side Throttling in Apache Beam

2023-05-30 Thread Robert Burke
Great article!

Though it's depressing to see we have a pair of magic counter names to help
modulate scaling behavior.

On Tue, May 30, 2023, 11:42 AM Jack McCluskey via dev 
wrote:

> Hey everyone,
>
> While working on some remote model handler code I hit a point where I
> needed to understand how Beam IOs interpret and action on being throttled
> by an external service. This turned into a few discussions and then a small
> write-up doc (
> https://docs.google.com/document/d/1ePorJGZnLbNCmLD9mR7iFYOdPsyDA1rDnTpYnbdrzSU/edit?usp=sharing)
> to encapsulate the basics of what I learned. If you're familiar with this
> topic feel free to make suggestions on the doc, I'm intending to add this
> to the wiki so there's a resource for how this works in the future!
>
> Thanks,
>
> Jack McCluskey
>
> --
>
>
> Jack McCluskey
> SWE - DataPLS PLAT/ Dataflow ML
> RDU
> jrmcclus...@google.com
>
>
>


Re: [Proposal] Automate Release Signing

2023-05-03 Thread Robert Burke
Kenn, I'll pose the question of why would Apache Infra have a supported
path for artifact signing that apparently violates Apache policy?

On Wed, May 3, 2023, 12:24 PM Kenneth Knowles  wrote:

> To clarify: I am in favor of automating what we can. There may be
> flexibility here in that only the source release needs to be signed in this
> way. But I expect this reduces the utility of the automation, as the
> release manager will still have to have a functioning published GPG key.
> Actually it might be clever for us to add this to the committer onboarding
> steps. You can also automatically sign your git commits with it, if you
> like.
>
> Kenn
>
> On Wed, May 3, 2023 at 12:20 PM Kenneth Knowles  wrote:
>
>> I don't think we can do this. Having the release signed by the actual
>> release manager is by design.
>>
>> https://www.apache.org/legal/release-policy.html#release-signing
>>
>> "All supplied packages MUST be cryptographically signed by the Release
>> Manager with a detached signature"
>>
>> Kenn
>>
>> On Wed, May 3, 2023 at 12:14 PM John Casey via dev 
>> wrote:
>>
>>> +1 to this as well.
>>>
>>> On Wed, May 3, 2023 at 3:10 PM Robert Burke  wrote:
>>>
>>>> +1 to simplifying release processes, since it leads to a more
>>>> consistent experience.
>>>>
>>>> If we continue to reduce release overhead we'll be able to react with
>>>> more agility when CVEs come a knocking.
>>>>
>>>> On Wed, May 3, 2023, 12:08 PM Jack McCluskey via dev <
>>>> dev@beam.apache.org> wrote:
>>>>
>>>>> +1 to automating release signing. As it stands now, this step requires
>>>>> a PMC member to add a new release manager's GPG key which can add time to
>>>>> getting a release started. This also results in the public key used to 
>>>>> sign
>>>>> each release changing from one version to the next, as different release
>>>>> managers have different keys. Making releases easier to perform and
>>>>> providing a standard signing key for each release both seem like wins 
>>>>> here.
>>>>>
>>>>> On Wed, May 3, 2023 at 2:40 PM Danny McCormick via dev <
>>>>> dev@beam.apache.org> wrote:
>>>>>
>>>>>> Hey everyone, I'm currently working on improving our release process
>>>>>> so that it's easier and faster for us to release. As part of this work, 
>>>>>> I'd
>>>>>> like to propose automating our release signing step (the push java
>>>>>> artifacts step of build_release_candidate.sh
>>>>>> <https://beam.apache.org/contribute/release-guide/#run-build_release_candidatesh-to-create-a-release-candidate>)
>>>>>> using GitHub Actions.
>>>>>>
>>>>>> To do this, we can follow the guide here
>>>>>> <https://infra.apache.org/release-signing.html#automated-release-signing>
>>>>>>  and
>>>>>> ask the Infra team to add a signing key that we can use to run the
>>>>>> workflow. Basically, the asks would be:
>>>>>>
>>>>>> 1) Add a signing key (and passphrase) as GH Actions Secrets so that
>>>>>> we can sign the artifacts.
>>>>>> 2) Allowlist a GitHub Action (crazy-max/ghaction-import-gpg) to use
>>>>>> the key to sign artifacts.
>>>>>> 3) Add an Apache token (name and password) as GH Actions Secrets so
>>>>>> that we can upload the signed artifacts to Nexus.
>>>>>>
>>>>>> Please let me know if you have any questions or concerns. If nobody
>>>>>> objects or raises more discussion points, I will assume lazy
>>>>>> consensus
>>>>>> <https://community.apache.org/committers/lazyConsensus.html> after
>>>>>> 72 hours.
>>>>>>
>>>>>> Thanks,
>>>>>> Danny
>>>>>>
>>>>>


Re: [Proposal] Automate Release Signing

2023-05-03 Thread Robert Burke
+1 to simplifying release processes, since it leads to a more consistent
experience.

If we continue to reduce release overhead we'll be able to react with more
agility when CVEs come a knocking.

On Wed, May 3, 2023, 12:08 PM Jack McCluskey via dev 
wrote:

> +1 to automating release signing. As it stands now, this step requires a
> PMC member to add a new release manager's GPG key which can add time to
> getting a release started. This also results in the public key used to sign
> each release changing from one version to the next, as different release
> managers have different keys. Making releases easier to perform and
> providing a standard signing key for each release both seem like wins here.
>
> On Wed, May 3, 2023 at 2:40 PM Danny McCormick via dev <
> dev@beam.apache.org> wrote:
>
>> Hey everyone, I'm currently working on improving our release process so
>> that it's easier and faster for us to release. As part of this work, I'd
>> like to propose automating our release signing step (the push java
>> artifacts step of build_release_candidate.sh
>> )
>> using GitHub Actions.
>>
>> To do this, we can follow the guide here
>>  and
>> ask the Infra team to add a signing key that we can use to run the
>> workflow. Basically, the asks would be:
>>
>> 1) Add a signing key (and passphrase) as GH Actions Secrets so that we
>> can sign the artifacts.
>> 2) Allowlist a GitHub Action (crazy-max/ghaction-import-gpg) to use the
>> key to sign artifacts.
>> 3) Add an Apache token (name and password) as GH Actions Secrets so that
>> we can upload the signed artifacts to Nexus.
>>
>> Please let me know if you have any questions or concerns. If nobody
>> objects or raises more discussion points, I will assume lazy consensus
>>  after 72
>> hours.
>>
>> Thanks,
>> Danny
>>
>


Re: [ANNOUNCE] New committer: Damon Douglas

2023-04-24 Thread Robert Burke
Congratulations Damon!!!

On Mon, Apr 24, 2023, 12:52 PM Kenneth Knowles  wrote:

> Hi all,
>
> Please join me and the rest of the Beam PMC in welcoming a new committer:
> Damon Douglas (damondoug...@apache.org)
>
> Damon has contributed widely: Beam Katas, playground, infrastructure, and
> many IO connectors. Damon does lots of code review in addition to code.
> (yes, you can review code as a non-committer!)
>
> Considering their contributions to the project over this timeframe, the
> Beam PMC trusts Damon with the responsibilities of a Beam committer. [1]
>
> Thank you Damon! And we are looking to see more of your contributions!
>
> Kenn, on behalf of the Apache Beam PMC
>
> [1]
>
> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
>


Re: [ANNOUNCE] New committer: Anand Inguva

2023-04-21 Thread Robert Burke
Congratulations Anand!

On Fri, Apr 21, 2023, 10:55 AM Danny McCormick via dev 
wrote:

> Woohoo, congrats Anand! This is very well deserved!
>
> On Fri, Apr 21, 2023 at 1:54 PM Chamikara Jayalath 
> wrote:
>
>> Hi all,
>>
>> Please join me and the rest of the Beam PMC in welcoming a new committer: 
>> Anand
>> Inguva (ananding...@apache.org)
>>
>> Anand has been contributing to Apache Beam for more than a year and
>> authored and reviewed more than 100 PRs. Anand has been a core contributor
>> to Beam Python SDK and drove the efforts to support Python 3.10 and Python
>> 3.11.
>>
>> Considering their contributions to the project over this timeframe, the
>> Beam PMC trusts Anand with the responsibilities of a Beam committer. [1]
>>
>> Thank you Anand! And we are looking to see more of your contributions!
>>
>> Cham, on behalf of the Apache Beam PMC
>>
>> [1]
>> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-
>> committer
>>
>


Re: [DISCUSS] @Experimental, @Internal, @Stable, etc annotations

2023-04-18 Thread Robert Burke
While this thread is beginning to move off topic, I think any real Beam 3.0
effort largely should start with "what we know we're going to keep", and
what else a refined/simplified surface looks like for each SDK. But I'm
sure there's some known things to cut too.

And that's ignoring any real breaking changes that would be healthy to make
for Beam.

But critically, what are the concrete benefits to pipeline authors in such
a move. Otherwise it's self serving churn.



On Tue, Apr 18, 2023, 9:08 AM Alexey Romanenko 
wrote:

>
> On 17 Apr 2023, at 21:14, Robert Burke  wrote:
>
> +1 on how to iterate without a Beam 3.0
>
> Often that just means, write the new thing, "support both for a
> while",make it clear how to migrate to the new thing, and the next Major
> Version just drops everything that doesn't cut the mustard anymore.
>
>
> Exactly! If we are all agree with this process of
> adding/deprecating/removing new/old API then I think we need to add it into
> Beam documentation to make it clear for developers and users (if not yet).
>
> The only issue here is that we don’t do Major releases often (v2.0.0 is
> dated 2017-05-17). I think we even don’t have a public roadmap for that and
> we almost never discussed “what" Beam 3.x should be and, the most important
> question, “when” it will happen (sorry if I missed that).
>
> —
> Alexey
>
>
>
> On Mon, Apr 17, 2023, 11:54 AM Ahmet Altay via dev 
> wrote:
>
>> It sounds like there is agreement in eliminating the
>> experimental annotation. Should we stop using them in new code? Or should
>> we do a pass to remove those annotations?
>>
>> On Mon, Apr 17, 2023 at 11:24 AM Kenneth Knowles  wrote:
>>
>>>
>>>
>>> On Mon, Apr 17, 2023 at 9:34 AM Kerry Donny-Clark via dev <
>>> dev@beam.apache.org> wrote:
>>>
>>>> +1 to eliminating @Experimental as a Beam level annotation.
>>>> I think the main point is that if no one pays attention to such
>>>> annotations, then they are only noise and deliver negative value.
>>>>
>>>
>>> Yes. Consider these two scenarios
>>>
>>> 1. We change an "experimental" API that is widely used. This causes a
>>> pain for many users. We would probably not do it, and we would catch it in
>>> code review.
>>> 2. We change a non-"experimental" API that is fairly new. This applies
>>> to many APIs, since we rarely remember to annotate new APIs. This causes
>>> just minor pain for just a few users. TBH I would be OK with this. Rigidity
>>> in rejecting such changes just means your first draft is your final draft.
>>> Try that in any other endeavor and see how it works for you :-)
>>>
>>> And it is worse than noise - there are some users who do pay attention
>>> to the annotations and are not using things even though they are super
>>> safe. That was the main reason I started this thread. The rest of my
>>> proposal was just to try to recover some flexibility, but it seems too hard
>>> and no immediate consensus on how/if we could manage it.
>>>
>>> Kenn
>>>
>>> PS I do agree with Kerry's PS and would love to have that discussion.
>>> Perhaps separately, since it will start from square one either way. Every
>>> time someone says "Beam 3.0" we should really be thinking "how can we
>>> iterate". One big breaking version change doesn't work.
>>>
>>
>> +1 - Thinking about "How can we iterate" would allow us to build
>> something users' want in shorter timelines.
>>
>>
>>>
>>>
>>>
>>> Kerry
>>>>
>>>> PS- Kenn says " the point about the culture of stagnation came from my
>>>> recent experiences as code reviewer where there was some idea that we
>>>> couldn't change things even when they were plainly wrong and the change was
>>>> plainly a fix." This seems like a major point that deserves a more focused
>>>> discussion.
>>>>
>>>> On Fri, Apr 14, 2023 at 5:47 PM Chamikara Jayalath via dev <
>>>> dev@beam.apache.org> wrote:
>>>>
>>>>> I think we've been using the Java Experimental tags in two ways.
>>>>>
>>>>> * New APIs
>>>>> * Any APIs that use specific features identified by pre-defined
>>>>> experimental Kind types defined in [1] (for example, I/O connectors APIs
>>>>> that use Beam Schemas).
>>>>>
>>>>> Removing the exper

Re: [DISCUSS] @Experimental, @Internal, @Stable, etc annotations

2023-04-17 Thread Robert Burke
+1 on how to iterate without a Beam 3.0

Often that just means, write the new thing, "support both for a while",make
it clear how to migrate to the new thing, and the next Major Version just
drops everything that doesn't cut the mustard anymore.


On Mon, Apr 17, 2023, 11:54 AM Ahmet Altay via dev 
wrote:

> It sounds like there is agreement in eliminating the
> experimental annotation. Should we stop using them in new code? Or should
> we do a pass to remove those annotations?
>
> On Mon, Apr 17, 2023 at 11:24 AM Kenneth Knowles  wrote:
>
>>
>>
>> On Mon, Apr 17, 2023 at 9:34 AM Kerry Donny-Clark via dev <
>> dev@beam.apache.org> wrote:
>>
>>> +1 to eliminating @Experimental as a Beam level annotation.
>>> I think the main point is that if no one pays attention to such
>>> annotations, then they are only noise and deliver negative value.
>>>
>>
>> Yes. Consider these two scenarios
>>
>> 1. We change an "experimental" API that is widely used. This causes a
>> pain for many users. We would probably not do it, and we would catch it in
>> code review.
>> 2. We change a non-"experimental" API that is fairly new. This applies to
>> many APIs, since we rarely remember to annotate new APIs. This causes just
>> minor pain for just a few users. TBH I would be OK with this. Rigidity in
>> rejecting such changes just means your first draft is your final draft. Try
>> that in any other endeavor and see how it works for you :-)
>>
>> And it is worse than noise - there are some users who do pay attention to
>> the annotations and are not using things even though they are super safe.
>> That was the main reason I started this thread. The rest of my proposal was
>> just to try to recover some flexibility, but it seems too hard and no
>> immediate consensus on how/if we could manage it.
>>
>> Kenn
>>
>> PS I do agree with Kerry's PS and would love to have that discussion.
>> Perhaps separately, since it will start from square one either way. Every
>> time someone says "Beam 3.0" we should really be thinking "how can we
>> iterate". One big breaking version change doesn't work.
>>
>
> +1 - Thinking about "How can we iterate" would allow us to build something
> users' want in shorter timelines.
>
>
>>
>>
>>
>> Kerry
>>>
>>> PS- Kenn says " the point about the culture of stagnation came from my
>>> recent experiences as code reviewer where there was some idea that we
>>> couldn't change things even when they were plainly wrong and the change was
>>> plainly a fix." This seems like a major point that deserves a more focused
>>> discussion.
>>>
>>> On Fri, Apr 14, 2023 at 5:47 PM Chamikara Jayalath via dev <
>>> dev@beam.apache.org> wrote:
>>>
 I think we've been using the Java Experimental tags in two ways.

 * New APIs
 * Any APIs that use specific features identified by pre-defined
 experimental Kind types defined in [1] (for example, I/O connectors APIs
 that use Beam Schemas).

 Removing the experimental tag has the effect of finalizing a number of
 APIs we've been reluctant to call stable (for example, Beam Schemas,
 portability, metrics related APIs). These APIs have been around for a long
 time and I don't see them changing so probably this is the right thing to
 do. But I just wanted to call it out.

 Thanks,
 Cham

 [1]
 https://github.com/apache/beam/blob/b9f27f9da2e63b564feecaeb593d7b12783192b0/sdks/java/core/src/main/java/org/apache/beam/sdk/annotations/Experimental.java#L48

 On Fri, Apr 14, 2023 at 1:26 PM Ahmet Altay via dev <
 dev@beam.apache.org> wrote:

>
>
> On Fri, Apr 14, 2023 at 1:15 PM Kenneth Knowles 
> wrote:
>
>>
>> Thanks for the discussion. Many good points. Probably just removing
>> all the annotations is a noop to users, and will solve the "afraid to use
>> experimental features" problem.
>>
>> Regarding stability, the capabilities of Java (and Python is much
>> much worse) make it infeasible to produce quality software with the rule
>> "once it is public it is frozen forever". But on the other hand, there
>> isn't much of a practical alternative. Most projects just make breaking
>> changes at minor releases quite often, in my experience. I don't want to
>> follow that pattern, for sure.
>>
>> Regarding Danny's comment of not seeing this culture - check out any
>> of our more mature IOs, which all have very high cyclomatic complexity 
>> due
>> to never being significantly refactored. Adhering to in-place state
>> compatibility for update instead of focusing on blue/green deployment is
>> also a culprit here. I don't have examples to mind, but the point about 
>> the
>> culture of stagnation came from my recent experiences as code
>> reviewer where there was some idea that we couldn't change things even 
>> when
>> they were plainly wrong and the change was plainly a fix.
>>
>> Often, it 

Re: Jenkins Flakes

2023-04-11 Thread Robert Burke
The coverage issue is only with the Java builds in specific.

Go abd Python have their coverage numbers codecov uploads done in GitHub
Actions instead.

On Tue, Apr 11, 2023, 8:14 AM Moritz Mack  wrote:

> Thanks so much for looking into this!
>
> I’m absolutely +1 for removing Jenkins related friction and the proposed
> changes sound legitimate.
>
>
>
> Also, considering the number of flaky tests in general [1], code coverage
> might not be the pressing issue. Should it be disabled everywhere in favor
> of more reliable / faster builds? Unless Devs here are willing to commit on
> taking actions, it doesn’t seem to provide too much value recording these
> numbers as part of the normal pre commit jobs?
>
>
>
> Kind regards,
>
> Moritz
>
>
>
> [1]
> https://github.com/apache/beam/issues?q=is%3Aissue+is%3Aopen+label%3Aflake
>
>
>
> On 11.04.23, 16:24, "Danny McCormick via dev"  wrote:
>
>
>
> ;tldr - I want to temporarily reduce the number of builds that we retain
> to reduce pressure on Jenkins Hey everyone, over the past few days our
> Jenkins runs have been particularly flaky across the board, with errors
> like the following showing
>
> *;tldr - I want to temporarily reduce the number of builds that we retain
> to reduce pressure on Jenkins*
>
>
>
> Hey everyone, over the past few days our Jenkins runs have been
> particularly flaky across the board, with errors like the following showing
> up all over the place [1]:
>
>
>
> java.nio.file.FileSystemException: 
> /home/jenkins/jenkins-home/jobs/beam_PreCommit_Python_Phrase/builds/3352/changelog.xml:
>  No space left on device [2]
>
>
>
> These errors indicate that we're out of space on the Jenkins master node.
> After some digging (thanks @Yi Hu  @Ahmet Altay
>  and @Bruno Volpato  for
> contributing), we've determined that at least one large contributing issue
> is that some of our builds are eating up too much space. For example, our
> beam_PreCommit_Java_Commit build is taking up 28GB of space by itself (this
> is just one example).
>
>
>
> @Yi Hu  found one change around code coverage that is
> likely heavily contributing to the problem and rolled that back [3]. We can
> continue to find other contributing factors here.
>
>
>
> In the meantime, to get us back to healthy *I propose that we reduce the
> number of builds that we are retaining to 40 for all jobs that are using a
> large amount of storage (>5GB)*. This will hopefully allow us to return
> Jenkins to a normal functioning state, though it will do so at the cost of
> a significant amount of build history (right now, for example,
> beam_PreCommit_Java_Commit is at 400 retained builds). We could restore the
> normal retention limit once the underlying problem is resolved. Given that
> this is irreversible (and not guaranteed to work), I wanted to gather
> feedback before doing this. Personally, I rarely use builds that old, but
> others may feel differently.
>
>
>
> Please let me know if you have any objections or support for this proposal.
>
>
>
> Thanks,
>
> Danny
>
>
>
> [1] Tracking issue: https://github.com/apache/beam/issues/26197
> 
>
> [2] Example run with this error:
> https://ci-beam.apache.org/job/beam_PreCommit_Python_Phrase/3352/console
> 
>
> [3] Rollback PR: https://github.com/apache/beam/pull/26199
> 
>
> *As a recipient of an email from the Talend Group, your personal data will
> be processed by our systems. Please see our Privacy Notice
> *for more information about our
> collection and use of your personal information, our security practices,
> and your data protection rights, including any rights you may have to
> object to automated-decision making or profiling we use to analyze support
> or marketing related communications. To manage or discontinue promotional
> communications, use the communication preferences portal
> . To exercise your data
> protection rights, use the privacy request form
> .
> Contact us here or by mail to either of
> our co-headquarters: Talend, Inc.: 400 South El Camino Real, Ste 1400, San
> Mateo, CA 94402; Talend SAS: 5/7 rue Salomon De Rothschild, 92150 Suresnes,
> France
>


Re: Jenkins Flakes

2023-04-11 Thread Robert Burke
+1

SGTM

Remember, if an issue is being investigated, a committer can always mark a
build to be retained longer in the Jenkins UI. Just be sure to clean it up
once it's resolved though.

(TBH there may also be some old retained builds like that, but I doubt
there's a good way to see which are still relevant.)

On Tue, Apr 11, 2023, 8:03 AM Yi Hu via dev  wrote:

> +1 Thanks Danny for figuring out a solution.
>
> Best,
> Yi
>
> On Tue, Apr 11, 2023 at 10:56 AM Svetak Sundhar via dev <
> dev@beam.apache.org> wrote:
>
>> +1 to the proposal.
>>
>> Regarding the "(and not guaranteed to work)" part, is the resolution that
>> the memory issues may still persist and we restore the normal retention
>> limit (and we look for another fix), or that we never restore back to the
>> normal retention limit?
>>
>>
>> Svetak Sundhar
>>
>>   Technical Solutions Engineer, Data
>> s vetaksund...@google.com
>>
>>
>>
>> On Tue, Apr 11, 2023 at 10:34 AM Jack McCluskey via dev <
>> dev@beam.apache.org> wrote:
>>
>>> +1 for getting Jenkins back into a happier state, getting release
>>> blockers resolved ahead of building an RC has been severely hindered by
>>> Jenkins not picking up tests or running them properly.
>>>
>>> On Tue, Apr 11, 2023 at 10:24 AM Danny McCormick via dev <
>>> dev@beam.apache.org> wrote:
>>>
 *;tldr - I want to temporarily reduce the number of builds that we
 retain to reduce pressure on Jenkins*

 Hey everyone, over the past few days our Jenkins runs have been
 particularly flaky across the board, with errors like the following showing
 up all over the place [1]:

 java.nio.file.FileSystemException: 
 /home/jenkins/jenkins-home/jobs/beam_PreCommit_Python_Phrase/builds/3352/changelog.xml:
  No space left on device [2]


 These errors indicate that we're out of space on the Jenkins master
 node. After some digging (thanks @Yi Hu  @Ahmet Altay
  and @Bruno Volpato  for
 contributing), we've determined that at least one large contributing issue
 is that some of our builds are eating up too much space. For example, our
 beam_PreCommit_Java_Commit build is taking up 28GB of space by itself (this
 is just one example).

 @Yi Hu  found one change around code coverage that
 is likely heavily contributing to the problem and rolled that back [3]. We
 can continue to find other contributing factors here.

 In the meantime, to get us back to healthy *I propose that we reduce
 the number of builds that we are retaining to 40 for all jobs that are
 using a large amount of storage (>5GB)*. This will hopefully allow us
 to return Jenkins to a normal functioning state, though it will do so at
 the cost of a significant amount of build history (right now, for example,
 beam_PreCommit_Java_Commit is at 400 retained builds). We could restore the
 normal retention limit once the underlying problem is resolved. Given that
 this is irreversible (and not guaranteed to work), I wanted to gather
 feedback before doing this. Personally, I rarely use builds that old, but
 others may feel differently.

 Please let me know if you have any objections or support for this
 proposal.

 Thanks,
 Danny

 [1] Tracking issue: https://github.com/apache/beam/issues/26197
 [2] Example run with this error:
 https://ci-beam.apache.org/job/beam_PreCommit_Python_Phrase/3352/console
 [3] Rollback PR: https://github.com/apache/beam/pull/26199

>>>


Re: [DISCUSS] @Experimental, @Internal, @Stable, etc annotations

2023-03-31 Thread Robert Burke
I've been thinking a similar thing for the Go SDK for a while, though Go
doesn't have annotations in the way lJava does.

I agree that explicitly Stable/ default Evolving is probably the best way
to go. Even for SDK devs, it's better the default isn't something that
needs to be remembered to be added, and then removed at a later date,
rather than promoting it to "this will continue to work the same way
(modulo bug fixes)".

The organic growth of the SDKs and their various packages don't necessarily
make it clear what's intended for end users vs SDK internal use. It would
be a fair amount of work to fix that in any v3, but this would be very
clear work in that direction since a v3 could become a clean up and
reorganization of anything that's deemed stable, while making that clear
structurally for end users.

On Fri, Mar 31, 2023, 2:05 PM Kenneth Knowles  wrote:

> Hi all,
>
> Long ago, we adopted two annotations in Beam to communicate to users:
>
>  - `@Experimental` indicates that an API might change
>  - `@Internal` indicates that an API is not meant for users.
>
> I've seen some real problems with this approach:
>
>  - Users are afraid to use `@Experimental` APIs, because they are worried
> they are not production-ready. But it really just means they might change,
> and has nothing to do with that.
>  - People write new code and do not put `@Experimental` annotations on it,
> even though it really should be able to change for a while, so we can do a
> good job.
>  - I'm seeing a culture of being afraid to change things, even when it
> would be good for users, because our API surface area is far too large and
> not explicitly chosen.
>  - `@Internal` is not that well-known. And now we have many target
> audiences: Beam devs, PTransform devs, tool devs, pipeline authors. Some of
> them probably want to use `@Internal` stuff!
>
> I looked at a couple sibling projects and what they have
>  - Flink:
>  - Spark:
>
> They have many more tags, and some of them seem to have reverse defaults
> to Beam.
>
> Flink:
> https://github.com/apache/flink/tree/master/flink-annotations/src/main/java/org/apache/flink/annotation
>
>  - Experimental
>  - Internal.java
>  - Public
>  - PublicEvolving
>  - VisibleForTesting
>
> Spark:
> https://github.com/apache/spark/tree/master/common/tags/src/main/java/org/apache/spark/annotation
>  and
> https://github.com/apache/spark/tree/master/common/tags/src/main/scala/org/apache/spark/annotation
>
>  - AlphaComponent
>  - DeveloperApi
>  - Evolving
>  - Experimental
>  - Private
>  - Stable
>  - Unstable
>  - Since
>
> I think it would help users to understand Beam with some simple, though
> possibly large-scale changes. My goal would be:
>
>  - new code is changeable/evolving by default (so we don't have to always
> remember to annotate it) but users have confidence they can use it in
> production (because we have good software engineering practices)
>  - Experimental would be reserved for more risky things
>  - after we are confident an API is stable, because it has been the same
> across a couple releases, we mark it
>
> A concrete proposal to achieve this would be:
>
>  - Add a @Stable annotation and use it as appropriate on our primary APIs
>  - [Possibly] add an @Evolving annotation that would also be the default.
>  - Remove most `@Experimental` annotations or change them to `@Evolving`
>  - Communicate about this (somehow). If possible, surface the `@Evolving`
> default in documentation.
>
> The last bit is the hardest.
>
> Kenn
>


[Proposal] State and Timer Composites (Go SDK)

2023-03-20 Thread Robert Burke
Hi everyone,

I have a proposal of an approach we could take for the Go SDK WRT State and
Timers, that might be of interest.

No changes are required to the FnAPI, as this is an SDK level proposal.

Proposal w/ example: https://github.com/apache/beam/issues/25894

The short version is to enable users (or beam contributors) to produce
better abstractions around state and timers, instead of requiring direct
use of the primitives. Higher level components would permit easier re-use
of common patterns that would otherwise require manual replication in a
user's DoFn.

This is current Go SDK specific, as we're finishing off being able have
timer support there, and would require a small non-breaking change on how
state is handled to enable.

Please take a look, and let me know what you think (here or on the issue).

Robert Burke
Beam Go Busybody


Re: direct runner OOM issue

2023-03-13 Thread Robert Burke
Which direct runner? They are language specific.

On Mon, Mar 13, 2023, 11:27 AM wilsonny...@gmail.com 
wrote:

> Hi guys,
>
> We are trying to run our pipeline using direct runner and the input
> dataset is a large amount of HDFS files (few hundred of GB data)
>
> We experienced OOM issue crash. Then inside the direct runner document, I
> realized direct runner loads the whole dataset into the memory.
>
> Is there any way we can avoid this OOM issue?
>
> Regards
>
> -
>
> Wilson(Xiaoshuang) Wang
> Sr. Software Engineer
>


Re: [DISCUSS] Provide MultimapUserStateHandler interface in StateRequestHandlers

2023-02-24 Thread Robert Burke
The runners should be able to support Multimap User State portably over the
FnApi already.

https://github.com/apache/beam/blob/master/model/fn-execution/src/main/proto/org/apache/beam/model/fn_execution/v1/beam_fn_api.proto#L937

How that's supported on each SDK is a different matter though.


On Fri, Feb 24, 2023, 12:57 PM Alan Zhang  wrote:

> Appreciate it if anyone can help confirm and share thoughts.
>
> On Wed, Feb 22, 2023 at 11:46 PM Alan Zhang  wrote:
>
>> Hi Beam devs.
>>
>> According to the Fn State API design doc[1], the state type
>> MultimapUserState is intended for supporting MapState/SetState. And the
>> implementation[2] for this state type is ready on the SDK harness side.
>> Each runner will be responsible for integrating it if they want to leverage
>> it.
>>
>> Today Beam uses StateRequestHandlers to define handler interfaces for
>> other state types, e.g. MultimapSideInputHandler for
>> MultimapSideInput, BagUserStateHandler for BagUserState, etc.[3] This is
>> great since each runner can implement these handler interfaces then the Fn
>> state API integration is done.
>>
>> In order to support MapState/SetState, I think we will need to provide
>> a MultimapUserStateHandler interface in StateRequestHandlers and allow the
>> runners to implement it.
>>
>> What do you think?
>>
>> Feel free to correct me if there is any incorrect understanding since I'm
>> new to the Beam world.
>>
>> Btw, I saw Flink Python used MultimapSideInput to support MapState[4] but
>> I think this is not recommended since MultimapUserState is available today.
>> But please correct me if I'm wrong.
>>
>>
>> [1] https://s.apache.org/beam-fn-state-api-and-bundle-processing
>> 
>> [2] https://github.com/apache/beam/pull/15238
>> [3]
>> https://github.com/apache/beam/blob/master/runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/state/StateRequestHandlers.java#L192
>> [4]
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-153%3A+Support+state+access+in+Python+DataStream+API
>> --
>> Thanks,
>> Alan
>>
>
>
> --
> Thanks,
> Alan
>


Re: [Go SDK] Direct Runner Replacement: Prism

2023-02-21 Thread Robert Burke
It's now all in! Thank you for following. Big thanks to Johanna, Jack, and 
Ritesh for doing reviews of the PRs.

My current plan is to not think about it for a bit, and resolve the blocker 
from adding Timers to the SDK. 

After that, the priorities are focused on resolving the Side Input memory leak 
that exists, and completing support for features that exist in the SDK, but 
aren't testable with the existing direct runner. Also moving it to being the 
default runner of the SDK (in either 2.47 or 2.48).

Next feature support includes State and Timers, but critically includes 
Supporting Cross Language transforms from Java and Python.

That in turn, unblocks the other SDKs being able to use it at all.

Appropriately tagged issues will be filed in the next week to allow others to 
contribute without duplicating work. Even at ~8k lines, though it's possible 
there may be merge conflicts though. :)

Thanks again, and lets keep making Beam easier to use!

Robert Burke
Beam Go Busybody

On 2023/02/20 23:30:06 Robert Burke wrote:
> Johanna managed to take a look at the penultimate PR, so that's now in, and
> finally, here's the last one.
> 
> https://github.com/apache/beam/pull/25568 - Add in the execution stack
> connecting preprocessing and execution. Includes all pipeline execution
> unit tests.
> 
> On Mon, 20 Feb 2023 at 10:26, Robert Burke  wrote:
> 
> > We're in the home stretch! Only 2 PRs to go, and then Prism will be
> > available for use from the Beam Repo, hopefully with the 2.46 cut. Also, it
> > would be available for others to improve and meaningfully extend.
> >
> > https://github.com/apache/beam/pull/25565 - Adds in the primary element
> > manager. This is where watermark handling and bundling decisions are made
> > for stages.
> >
> > Once that PR is in, there's the execution handling which puts everything
> > together, and the entry point, and the suite of pipeline execution tests.
> >
> > Robert Burke
> > Beam Go Busybody
> >
> > On Sun, 19 Feb 2023 at 18:25, Robert Burke  wrote:
> >
> >> I had to scratch the itch this weekend, and now have Prism able to pay
> >> attention to Estimated Output Watermarks, and send elements downstream if
> >> the transform is not blocked
> >> by watermarks somewhere. Woohoo! Is it fully correct?  Probably not, but
> >> it's a start.
> >>
> >> Transforms are still executed one at a time, but this is currently a
> >> matter of implementation, rather than inability. Changing that
> >> implementation will occur with progress tracking and split request
> >> handling, which will be required by the separation harness tests.
> >>
> >> All data is presently cached and stored indefinitely in memory which
> >> prevents indefinite runs of a fully unbounded pipeline. Primary inputs
> >> consume & garbage collect data as the pipeline advances, but since side
> >> inputs are re-used they use their own thing, and the system isn't aware
> >> when it can be safely garbage collected.
> >>
> >> Big Everything PR is at https://github.com/apache/beam/pull/25391
> >>
> >> Next up PRs:
> >>
> >> https://github.com/apache/beam/pull/25556 - Remaining initial job
> >> services.
> >> https://github.com/apache/beam/pull/25557 - Test DoFns for later use in
> >> pipelines.
> >> https://github.com/apache/beam/pull/25558 - Handling graph
> >> transformations for Combine & SDF composites, executing Flattens and GBKs
> >>
> >> On Thu, 16 Feb 2023 at 15:02, Robert Burke  wrote:
> >>
> >>> Next up:
> >>>
> >>> https://github.com/apache/beam/pull/25478 - Large PR for initial
> >>> handling of worker FnAPI surfaces.
> >>> https://github.com/apache/beam/pull/25518 - Tiny PR for handling basic
> >>> windowing strategies.
> >>> https://github.com/apache/beam/pull/25520 - Medium PR for adding the
> >>> graph preprocessor scaffolding
> >>>
> >>>
> >>> On 2023/02/15 05:41:51 Robert Burke via dev wrote:
> >>> > Here are the next two chunks!
> >>> >
> >>> > https://github.com/apache/beam/pull/25476 - Coder / element / bytes
> >>> > handling internally for prism.
> >>> > https://github.com/apache/beam/pull/25478 - Worker fnAPI handling.
> >>> >
> >>> > Took a bit to get a baseline of unit testing in for these, since they
> >>> were
> >>> > covered by whole pipeline runs.
> >>> > Coders in particular, since they currently live in the packa

Re: new contributor messaging: behaviorbot/welcome

2023-02-21 Thread Robert Burke
I agree that the bot is better than nothing at all.

+1 to getting a PR with messaging out for review.

On Tue, Feb 21, 2023, 5:29 PM Robert Bradshaw via dev 
wrote:

> FWIW, I'm generally in favor of such a bot. I think it really boils
> down to a concrete proposal of what the content (and triggers) would
> be.
>
> On Tue, Feb 21, 2023 at 1:36 PM Austin Bennett
>  wrote:
> >
> > It is fantastic if generally able to address welcoming newcomers
> manually [ @Robert Burke ! ] .  Community communication, human connection [
> ex: community > code ] ideal!!  In this particular case, I imagine
> automation does not contradict - nor detract from - the manual/human touch.
> >
> > As shared, the very specific use case I had in mind was to support -->
> https://news.apache.org/foundation/entry/the-asf-launches-firstasfcontribution-campaign
> ...  I wanted to send a message thanking for someone's first PR merge, and
> encourage them to fill out the form ( while that campaign is active.  In
> that case, I did imagine a static [ meaning hardcoded, non-changing ]
> message that prompts them at the moment that they make their real first
> code contribution [ as it gets merged ], since that would be most relevant
> and immediate feedback.
> >
> > If we think overkill, no problem either.  If an issue with choosing to
> use a bot, vs a GH action - I can also spend time to create a custom GH
> Action that accommodates that.  But, that might not be worthwhile if the
> discussed use case isn't functionality we even want as part of the project.
> >
> > On Tue, Feb 21, 2023 at 12:28 PM Robert Bradshaw 
> wrote:
> >>
> >> On Tue, Feb 21, 2023 at 10:59 AM Kenneth Knowles 
> wrote:
> >> >
> >> > Agree with Robert here. The human connection is important. Can we
> have a behaviorbot that reminds the reviewer to be extra welcoming up
> front, and then thankful afterwards, instead? :-)
> >>
> >> +1
> >>
> >> > That said, a bot comment would at least state our intention of being
> welcoming and grateful, even if we then do not live up to it perfectly. It
> isn't very different than having it in the PR template or
> https://beam.apache.org/contribute/ or CONTRIBUTING.md which GitHub
> presents to first time contributors. I tend to favor static text that can
> be referred to over dynamic text posted by code in special circumstances.
> But I think hitting this from all angles, for different sorts of people in
> the world, is fine, if the maintenance burden is very low (which it appears
> to be)
> >>
> >> I think the primary value in such a bot is to set expectations/inform
> >> the contributor of something they might not know but is relevant to
> >> their action. Otherwise, I am more in favor of static text somewhere
> >> they're sure to encounter it (and there are benefits to doing it
> >> before they create a PR, e.g. as part of a template, rather than
> >> after).
> >>
> >>
> >> > On Tue, Feb 21, 2023 at 10:01 AM Robert Burke 
> wrote:
> >> >>
> >> >> I can't speak for all committers but I'm always aware when it's
> someone's first time contributing to beam (the First Time Contributor badge
> is instrumental here), and manually thank them and welcome them to Beam.
> >> >>
> >> >> Seems more meaningful for the merging comitter to do it rather than
> an automated process.
> >> >>
> >> >> Maybe i just have bad experiences with automated phone trees
> >> >>
> >> >> On Tue, Feb 21, 2023, 9:02 AM Danny McCormick via dev <
> dev@beam.apache.org> wrote:
> >> >>>
> >> >>> If the merge message is a key part of this then I'm fine using
> behaviorbot (though I think a PMC member would need to install it, I don't
> have the right permission set).
> >> >>>
> >> >>> > I'd also be happy to leverage first-interaction for everything it
> can do, and only use welcome-bot for the things that aren't met elsewhere [
> also happy to eventually remove welcome-bot, ex: after that ASF campaign or
> once a suitable off-the-shelf replacement comes along ]
> >> >>>
> >> >>> I don't think we should do this, there's not really a benefit
> gained if we're still using welcome-bot.
> >> >>>
> >> >>> > @Danny McCormick - any idea whether there is another tool that
> can help with messaging on first-pr-merge that we'd be more happy with [ I
> can search around some if that's the path ]?
> >> >>>
> >> >>&

  1   2   3   4   5   >