echauchot commented on issue #23576: [SPARK-26655] [SS] Support multiple aggregates in append mode URL: https://github.com/apache/spark/pull/23576#issuecomment-525208412 Hi @HeartSaVioR, thanks for your feedack. Regarding late data with Beam, indeed when an element comes behind the watermark but before allowed lateness it delays window closing. So the element that comes in that ranges counts in the output data. If it comes after allowed lateness, it is dropped. Regarding output mode, most of Beam runners (spark for ex) support discarding output in which element from different windows are independent and previous states are dropped. It seems very similar to spark append mode. Regarding event time, indeed beam forbids modifying it and also there is an event time per element the same way you suggest event time per row in spark. Also our answer to join-join stream watermark is: we take the minimum of the output watermark of previous stages. But that is because Beam WM is based on the minimum event timestamp seen. Also stateful operators do not change the event timestamp, no body does. That is why we defined input and output watermark to introduce this delay. => That is another way of solving the problem but I agree, it requires a good amount of change in the spark model to introduce input/output watermarks and per-operator ones.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
