echauchot commented on issue #23576: [SPARK-26655] [SS] Support multiple 
aggregates in append mode
URL: https://github.com/apache/spark/pull/23576#issuecomment-525638792
 
 
   > > Regarding output mode, most of Beam runners (spark for ex) support 
discarding output in which element from different windows are independent and 
previous states are dropped.
   > 
   > I'm not sure I understand it correctly. The point for Append mode is, 
output for specific key (key shouldn't be necessary to be windowed, but should 
include "event time" column) will be provided only once in any case (orthogonal 
to fault tolerance, and doesn't mean "exactly-once" here), regardless of 
allowed lateness, no case of "upsert". If Beam doesn't close the window when 
watermark passes by (but still doesn't pass by allowed lateness) but triggers 
window and emits the output of window so far (so output could be emitted 
multiple times), it's not compatible with Spark's Append mode.
   > 
   
   Beam does not trigger output unless the watermark pass the end of window + 
allowed lateness. There is no triggering between end of window and allowed 
lateness. Close and output is at the same time.
   > stream-stream join should decide which "event time" should be taken even 
we change the way of storing event time, as there're two rows being joined. How 
Beam decides "event time" for new record from two records? In column based 
event time (current Spark), it should be hard to choose "min" or "max" of event 
time, as which column to pick as event time should be decided by query plan 
phase.
   
   Ah I thought we were talking about watermark. For choosing the event 
timestamp, Beam uses a TimestampCombiner which default policy is to set the 
resulting timestamp to the end of the window for new record.
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to