Re: [DESIGN] Timers, Watermark Holds, Loops, Batch and Drain

Kenneth Knowles Fri, 25 Apr 2025 07:44:39 -0700

I love using one WITH_EXTENSIONS bit and then adding a proto. Especially
for full WindowedValue where there is already ~10-12 bytes overhead, it
doesn't make sense to create our own extensible encoding protocol.


 - I think we can actually do this without a breaking change, too. Which
part were you thinking would be a breaking change?
 - Regarding "being a pain" to change, I think adding fields to
WindowedValue and adding fields to the extension *should* be about the same
code difficulty. In both cases the new fields must be optional and have
nulls or defaults. Putting it in a new object makes sense, though, since
it'll be encoded separately with proto, which is way easier and more robust.

Anyhow +100 to this new directly for the idea. There are even more
extensions that I've been thinking about and this makes it very flexible.

Kenn

On Fri, Apr 25, 2025 at 9:17 AM Radek Stankiewicz <[email protected]>
wrote:

> hi Kenn,
> I would add option 3
> Option 3:  Use current way of using tag, add drain in additional proto
> object
>
>    -
>
>    The tag & 0b0111 indicates whether it is using FIRST, ONE_INDEX or
>    TWO_INDICES encoding and future options if needed.
>    -
>
>    tag & 0b10000 ("WITH_EXTENSION" ) indicates there is extension.
>    -
>
>    If extension is set, an additional proto object is passed.
>    -
>
>    Proto provides natural way of versioning and allows adding more fields
>    in the future
>    -
>
>       No extension bit unspecified:
>       -
>
>          Drain mode unspecified
>          -
>
>          Tracing Context is null
>          -
>
>       Extension bit set:
>       -
>
>          Drain mode can be set
>          -
>
>          Tracing context and trace state can be set
>          -
>
>       Extension bit set, some future fields are added:
>       -
>
>          etc.
>
> The Extension object within WindowedValue (WV) shouldn’t be accessed
> directly by the user, ideally when new field is added, e.g. drainMode, WV
> should have added getter getDrainMode() and access directly enum from
> extension
> public WindowedValueExtensions.DrainMode getDrainMode(){
>       return extensions.getDrain();
>     }
>
> To set the value drain mode:
> windowedElem=
> windowedElem.withExtension(windowedElem.getExtensions().toBuilder().setDrain(WindowedValueExtensions.DrainMode.
> *DRAIN_MODE_DRAINING*).build());
> or:
> public WindowedValue<T> withDrainMode(DrainMode dm){
>       return
> this.withExtension(windowedElem.getExtensions().toBuilder().setDrain(dm));
> }
>
> Adding new fields to WindowedValue is a pain as it requires multiple
> changes all over the place, so doing it once with extensions field should
> simplify future fields.
>
> There may be situations where due to performance reasons storing value in
> extension is not optimal - e.g. tracing context as object is parsed array
> of strings but for interoperability it is usually propagated as two strings
> (traceparent, tracestate). In such case:
>
>    -
>
>    WV should have added field traceContext of type Context
>    -
>
>    WV should have added getter getTraceContext() and access field directly
>    -
>
>    WV should have added withTraceContext() method to construct WV with
>    previous fields
>    -
>
>    When sending WV over the wire, the WV coder should encode traceContext
>    into extensions object
>    -
>
>    When receiving WV over the wire, the WV coder should decode extension
>    and construct optimal traceContext if traceparent and tracestate exists in
>    extension.
>
>
> Summary
> Pros:
>
>    - We are no longer limited with Pane bits as a way of extending
>    WindowedValue
>    - There is no overhead if none of the extended fields is set.
>
> Cons:
>
>    - Memory overhead on keeping additional deserialized objects (here,
>    trace context).
>    - Confusion where is the field (wv.getContext()  or
>    wv.getExtensions().getTraceParent())
>    - Breaking change
>
>
>
> On Wed, Apr 2, 2025 at 11:17 PM Kenneth Knowles <[email protected]> wrote:
>
>> Just to organize my thoughts and to refresh myself on how this would
>> integrate with each SDK, I wrote this short follow-up on the API, PaneInfo
>> encoding, and proto changes. I'm especially interested in feedback from
>> people familiar with various SDKs about mistakes I've made in that SDK and
>> the most idiomatic ways to represent this.
>>
>>    https://s.apache.org/beam-drain-mode
>>
>> Kenn
>>
>> On Tue, Mar 18, 2025 at 11:42 AM Kenneth Knowles <[email protected]> wrote:
>>
>>> Thanks for all the detailed insightful comments! I've made some edits to
>>> reflect your insights and also settled on what I think is the most feasible
>>> plan, considering effort and backwards-compatibility.
>>>
>>> The work is actually now quite small (yay!). I've added these at the
>>> bottom of the doc:
>>>
>>>  - Add "is drain" bit for aggregations
>>>  - Add "is drain" bit for timer callbacks
>>>  - Fire processing time timers instantly during drain (likely getting
>>> dropped because their window is expired)
>>>  - Immediately drop already-expired processing time timers (since they
>>> are doomed to be dropped when they fire)
>>>
>>> This is a set of backwards-compatible changes consistent with "option 1"
>>> change proposal in the doc.
>>>
>>> Let me know if you think this is not enough or too much.
>>>
>>> Kenn
>>>
>>> On Wed, Mar 12, 2025 at 6:33 PM Jan Lukavský <[email protected]> wrote:
>>>
>>>> Hi Kenn,
>>>>
>>>> thanks for putting this down on paper. This is great initiative as it
>>>> might touch some core parts of the model we are actually somewhat circling
>>>> around. I left some comments and I'm looking forward to the broad
>>>> discussion this definitely deserves.
>>>>
>>>>  Jan
>>>> On 3/11/25 15:46, Kenneth Knowles wrote:
>>>>
>>>> Hi all,
>>>>
>>>> I've spent some time on the rather hairy details of timers, batch,
>>>> drain, and timer loops lately. We have some inconsistencies in this area
>>>> that regularly bite users.
>>>>
>>>> I've written up what I *think* are the desired semantics. some notes
>>>> about the current status of implementation, and a few proposals for how we
>>>> can improve things.
>>>>
>>>> https://s.apache.org/beam-timers-and-drain
>>>>
>>>> Please take a look! This stuff is hard and I could really use as many
>>>> smart pairs of eyes on this as possible.
>>>>
>>>> Kenn
>>>>
>>>>

Re: [DESIGN] Timers, Watermark Holds, Loops, Batch and Drain

Reply via email to