HeartSaVioR commented on pull request #35521: URL: https://github.com/apache/spark/pull/35521#issuecomment-1051550441
No, they are quite different. Suppose we have just seen event A at event time 12:00. Is the goal of deduplication to remove deduplicated events having event time "before 12:00"? No. The main goal on deduplicate is to deduplicate "future events" (with "older events as well" since we allow late events). Ideally saying, we have to deduplicate "all" older events and future events, but former requires infinite watermark gap (or unbounded size of state), and latter requires unbounded size on state. While the threshold on former can be defined via watermark gap, the threshold on latter should be defined via TTL, not watermark gap. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
