HeartSaVioR commented on pull request #35521:
URL: https://github.com/apache/spark/pull/35521#issuecomment-1051550441


   No, they are quite different.
   
   Suppose we have just seen event A at event time 12:00. Is the goal of 
deduplication to remove deduplicated events having event time "before 12:00"? 
No. The main goal on deduplicate is to deduplicate "future events" (with "older 
events as well" since we allow late events).
   
   Ideally saying, we have to deduplicate "all" older events and future events, 
but former requires infinite watermark gap (or unbounded size of state), and 
latter requires unbounded size on state. While the threshold on former can be 
defined via watermark gap, the threshold on latter should be defined via TTL, 
not watermark gap.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to