echauchot commented on issue #23576: [SPARK-26655] [SS] Support multiple 
aggregates in append mode
URL: https://github.com/apache/spark/pull/23576#issuecomment-525208412
 
 
   Hi @HeartSaVioR, thanks for your feedack. Regarding late data with Beam, 
indeed when an element comes behind the watermark but before allowed lateness 
it delays window closing. So the element that comes in that ranges counts in 
the output data. If it comes after allowed lateness, it is dropped. Regarding 
output mode, most of Beam runners (spark for ex) support discarding output in 
which element from different windows are independent and previous states are 
dropped. It seems very similar to spark append mode.
   
   Regarding event time, indeed beam forbids modifying it and also there is an 
event time per element the same way you suggest event time per row in spark. 
Also our answer to join-join stream watermark is: we take the minimum of the 
output watermark of previous stages. But that is because Beam WM is based on 
the minimum event timestamp seen. Also stateful operators do not change the 
event timestamp, no body does. That is why we defined input and output 
watermark to introduce this delay.
   => That is another way of solving the problem but I agree, it requires a 
good amount of change in the spark model to introduce input/output watermarks 
and per-operator ones.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to