[PROPOSAL] Upgrade vendor grpc

2023-03-29 Thread Yi Hu via dev
Hi all,

I would like to volunteer to upgrade the Beam vendored grpc, as
requested by the GitHub Issue [1]. I checked the project history that we
did four upgrades in the last 2 years (1.26->1.36->1.43->1.48) and the last
time was in Aug 2022 [2]. There have been vulnerabilities in its
dependencies found since then (see [1]).

My plan is to follow the release process [3, 4], which involves preparing
for the release, building a candidate, voting and finalizing the release.
Then the vendored artifact is targeted to be integrated by Beam v2.48.0
onwards (cut date May 17, 2023).

Please let me know if you have any comments/objections/questions.

Thanks,

Yi

[1] https://github.com/apache/beam/issues/25746
[2] https://github.com/apache/beam/pull/22628
[3] https://github.com/apache/beam/tree/master/vendor
[4]
https://docs.google.com/document/d/1ztEoyGkqq9ie5riQxRtMuBu3vb6BUO91mSMn1PU0pDA/edit#heading=h.vhcuqlttpnog

-- 

Yi Hu, (he/him/his)

Software Engineer


Re: [DESIGN] Beam Triggered side input specification

2023-03-29 Thread Jan Lukavský
> Well yes it was (though as mentioned before, the fact that none of 
these designs were even written into the spec is a problem), though in 
some ways not a great one. The only global synchronization method we had 
was the watermark/end of window, so if the source PCollection was 
triggered by something else we lost that.This creates some unfortunate 
situations (in particular I would not recommend using distributed 
Map-valued side inputs with an early trigger - the behavior is probably 
not what one expects). Part of the problem is that triggers themselves 
are non determistic. Something like retractions would make this better 
but not completely. Something better here would be great, but I'm still 
not sure what it would be or if any of our runners could implement it.


Yes, the problem is due to the fact that triggers fire at 
non-deterministic event times, because most of Beam's current triggers 
are either processing time triggers or data-driven triggers. We could 
obtain the same behavior as for the end-of-window trigger with 
event-time triggers (the EOW trigger and GC trigger are AFAIK the only 
event time triggers we currently have). These might be useful on its 
own, but would also require some what more complicated logic in GBK 
(splitting window into panes, holding state for each pane independently, 
merging state for accumulating triggers, ...).


  Jan



On 3/28/23 17:26, Reuven Lax via dev wrote:



On Tue, Mar 28, 2023 at 12:39 AM Jan Lukavský  wrote:


On 3/27/23 19:44, Reuven Lax via dev wrote:



On Mon, Mar 27, 2023 at 5:43 AM Jan Lukavský  wrote:

Hi,

I'd like to clarify my understanding. Side inputs generally
perform a left (outer) join, LHS side is the main input, RHS
is the side input.


Not completely - it's more of what I would call a nested-loop
join. I.e. if the side input changes _nothing happens_ until a
new element arrives on the LHS. This isn't quite the same as a
left-outer join.

+1. This makes sense, my description was a slight simplification.


Doing streaming left join requires watermark synchronization,
thus elements from the main input are buffered until
main_input_timestamp > side_input_watermark. When side input
watermark reaches max watermark, main inputs do not have to
be buffered because the side input will not change anymore.
This works well for cases when the side input is bounded or
in the case of "slowly changing" patterns (ideally with
perfect watermarks, so no late data present).


This is true for non-triggered side inputs. Triggered side inputs
have always been different - the main-input elements are buffered
until the first triggered value of the side input is available.

I walked again through the code in
SimplePushBackSideInputDoFnRunner and looks like this is correct,
the runner actually does not wait for watermark, but for "ready
windows", which implies what you say. With suitable trigger
(AfterWatermark.pastEndOfWindow() this coincides with the
watermark of end of the window.


Allowing arbitrary changes in the side input (with arbitrary
triggers) might introduce additional questions - how to
handle late data in the side input? Full implementation would
require retractions. Dropping late data does not feel like a
solution, because then the pipeline would not converge to the
"correct" solution, as the side input might hold incorrect
value forever. Applying late data from the processing time
the DoFn receives them could make the downstream processing
unstable, restarting the pipeline on errors might change what
is "on time" and what is late thus generate inconsistent
different results.

BTW, triggered side inputs have always been available. The
problem Kenn is addressing is that nobody has ever written down
the spec! There was a spec in mind when they were implemented,
but the fact that this was not written has always been
problematic (and especially so when creating the portable runner).

Triggered side inputs have always had some
non-determinstic behavior, not just for late data. Side inputs
are cached locally on the reader, so different reading workers
might have different views on what the latest trigger was.

Makes sense, is this a design decision? I can imagine that waiting
for side input watermark unconditionally adds latency, on the
other hand an "unexpected" non-deterministic behavior can confuse
users. This type of non-determinism after pipeline failure and
recovery is even the most hard to debug. If we would document (and
possibly slightly reimplement) the triggered side-input spec,
could we add (optional) way to make the processing deterministic
via watermark sync?


Well yes it was (though as mentioned before, the fact that 

A Message from the Board to PMC members

2023-03-29 Thread Rich Bowen
Dear Apache Project Management Committee (PMC) members,

The Board wants to take just a moment of your time to communicate a few
things that seem to have been forgotten by a number of PMC members,
across the Foundation, over the past few years.  Please note that this
is being sent to all projects - yours has not been singled out.

The Project Management Committee (PMC) as a whole[1] is tasked with the
oversight, health, and sustainability of the project. The PMC members
are responsible collectively, and individually, for ensuring that the
project operates in a way that is in line with ASF philosophy, and in a
way that serves the developers and users of the project.

The PMC Chair is not the project leader, in any sense. It is the person
who files board reports and makes sure they are delivered on time. It
is the secretary for the project, and the project’s  ambassador to the
Board of Directors. The VP title is given as an artifact of US
corporate law, and not because the PMC Chair has any special powers. If
you are treating your PMC Chair as the project lead, or granting them
any other special powers or privileges, you need to be aware that
that’s not the intent of the Chair role. The Chair is a PMC member peer
with a few extra duties.

Every PMC member has an equal voice in deliberations. Each has one
vote. Each has veto power. Every vote weighs the same. It is not only
your right, but it is your obligation, to use that vote for the good of
the project and its users, not to appease the Chair, your employer, or
any other voice in the project. 

Every PMC member can, and should, nominate new committers, and new PMC
members. This is not the sole domain of the PMC Chair. This might be
your most important responsibility to the project, as succession
planning is the path to sustainability.

Every PMC member can, and should, respond when the Board sends email to
your private list. You should not wait for the PMC Chair to respond.
The Board views the entire PMC as responsible for the project, not just
one member.

Every PMC member should be subscribed to the private@ mailing list. If
you are not, then you are neglecting your duty of oversight. If you no
longer wish to be responsible for oversight of the project, you should
resign your PMC seat, not merely drop off of the private@ list and
ignore it. You can determine which PMC members are not subscribed to
your private list by looking at your PMC roster at
https://whimsy.apache.org/roster/committee/  Names with an asterisk (*)
next to them are not subscribed to the list. We encourage you to take a
moment to contact them with this information.

Thank you for your attention to these matters, and thank you for
keeping our projects healthy.

Rich, for The Board of Directors

[1] https://apache.org/foundation/how-it-works.html#pmc-members