lostluck commented on issue #23963:
URL: https://github.com/apache/beam/issues/23963#issuecomment-1312299776

   This is an unfortunate oversight.
   
   Right now the current SDK behavior is incorrect (a consequence of being 
heavily tested under Global Window batch, and this behavior added pre robust 
windowing). Basically what's going wrong is that we don't know a good event 
time for the subsequently emitted batch. In general, we can't* from a framework 
perspective, as it's user data specific. 
   
   The work around would be to use emitters of the style func(beam.EventTime,V) 
or func(beam.EventTime, K, V) in both ProcessElement and FinishBundle while 
also requesting the existing element's beam.EventTime. This, in combination 
with storing a reasonable event time for the batch, would be a workaround to 
the windowing issue. The windowing would also need to be downstream of this 
DoFn to ensure windows.
   
   ----
   
   So the SDK would need to prevent this bad behavior going forward (failing on 
strict modes for it, providing a descriptive warning and instructions for 
fixing it as well). The SDK analysis portion would need to allow a slight 
mismatch in emitters, so the StartBundle/FinishBundle emitters are required to 
have an explicit event time, and not have to match the ProcessElement emitter 
exactly. This forces setting explicit event times, allowing for the correct 
windowing behavior.
   
   The event time thing is the big sticking point since if the elements don't 
have an event time, that means any *subsequent* windowing will have issues as 
well. So to permit any existing broken code to continue to function, it would 
need to analyze if there's a down stream issue,
   
   Having a robust GroupIntoBatches transform would be able to avoid this (have 
the GIB transform do the grouping, and handling the event time management), 
since it would then handle this logic for you. This is already on my "sooner", 
rather than "later" list. (I can't be more precise than that).
   
   ----
   
   *Technically CombineFn's have the framework make a specific choice for it's 
equivalent behavior here (what is a Lifted CombineFn, but a DoFn with fancy 
batching after all?), but we simply use the most default strategy for beam.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to