On 2/21/24 17:52, Robert Bradshaw via dev wrote:
On Wed, Feb 21, 2024 at 12:48 AM Jan Lukavský <je...@seznam.cz> wrote:
Hi,

I have left a note regarding the proposed splitting of batch and
streaming expansion of this transform. In general, a need for such split
triggers doubts in me. This signals that either

   a) the transform does something is should not, or

   b) Beam model is not complete in terms of being "unified"

The problem that is described in the document is that in the batch case
timers are not fired appropriately.
+1. The underlying flaw is that processing time timers are not handled
correctly in batch, but should be (even if it means keeping workers
idle?). We should fix this.

This is actually on of the
motivations that led to introduction of @RequiresTimeSortedInput
annotation and, though mentioned years ago as a question, I do not
remember what arguments were used against enforcing sorting inputs by
timestamp in the batch stateful DoFn as a requirement in the model. That
would enable the appropriate firing of timers while preserving the batch
invariant which is there are no late data allowed. IIRC there are
runners that do this sorting by default (at least the sorting, not sure
about the timers, but once inputs are sorted, firing timers is simple).

A different question is if this particular transform should maybe fire
not by event time, but rather processing time?
Yeah, I was reading all of these as processing time. Throttling by
event time doesn't make much sense.
Yeah, I was not aware that we do not trigger processing time timers in batch case, which is why I was under the wrong impression that the document talks about event time timers, which really makes little sense. I think generally it should be possible to correctly fire both event and processing time timers in the batch case.

On 2/21/24 03:00, Robert Burke wrote:
Thanks for the design Damon! And thanks for collaborating with me on getting a 
high level textual description of the key implementation idea down in writing. 
I think the solution is pretty elegant.

I do have concerns about how different Runners might handle 
ProcessContinuations for the Bounded Input case. I know Dataflow famously has 
two different execution modes under the hood, but I agree with the principle 
that ProcessContinuation.Resume should largely be in line with the expected 
delay, though it's by no means guaranteed AFAIK.

We should also ensure this is linked from https://s.apache.org/beam-design-docs 
if not already.

Robert Burke
Beam Go Busybody

On 2024/02/20 14:00:00 Damon Douglas wrote:
Hello Everyone,

The following describes a Throttle PTransform that holds element throughput
to minimize downstream API overusage. Thank you for reading and your
valuable input.

https://s.apache.org/beam-throttle-transform

Best,

Damon

Reply via email to