viirya commented on pull request #35521: URL: https://github.com/apache/spark/pull/35521#issuecomment-1042730217
> We have to think thoughtfully about "when" we can evict the state row safely. Suppose the input rows having all duplications keep coming with timestamp. With event time semantic and set of the watermark gap we can deduplicate a group of input rows, but once the state row is evicted out we will produce a new output and put to state, which makes the output be indeterministic. How long it will deduplicate the events depends on the event time value of the first event. This is a good point. Even with watermark, cannot we update the event time in state store with maximum event time among duplicated rows so far too? The difference is which one is used to decide when to evict the state row, TTL or watermark. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
