viirya commented on pull request #35521:
URL: https://github.com/apache/spark/pull/35521#issuecomment-1042730217


   > We have to think thoughtfully about "when" we can evict the state row 
safely. Suppose the input rows having all duplications keep coming with 
timestamp. With event time semantic and set of the watermark gap we can 
deduplicate a group of input rows, but once the state row is evicted out we 
will produce a new output and put to state, which makes the output be 
indeterministic. How long it will deduplicate the events depends on the event 
time value of the first event.
   
   This is a good point. Even with watermark, cannot we update the event time 
in state store with maximum event time among duplicated rows so far too? The 
difference is which one is used to decide when to evict the state row, TTL or 
watermark.
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to