HeartSaVioR edited a comment on pull request #35521: URL: https://github.com/apache/spark/pull/35521#issuecomment-1043728126
The semantic of watermark gap is an allowance of "late events". The semantic of TTL here is an allowance of "events in near future". The guaranteeing of both are quite opposite. Suppose we have an event E2 timestamped as 12:00 as input and there was an event E1 timestamped as 11:50. When E2 is processed, the availability of E1 is totally depending on the advance of watermark. E1 may have evicted before, which leads E2 to be provided as an output. E1 may have retained as well, depending on the watermark gap and the advance of watermark. That is not guarded by the guaranteeing of the watermark. With TTL & event time semantic, if the TTL is set to 30 mins, it is guaranteed by the semantic of watermark that E1 is available when E2 is processed. The earliest time E1 can be evicted is at 12:20. It is possible that E1 lives more than the TTL (for a specific single batch) and deduplicates more events than we expect (the guaranteeing of watermark is one way as we documented in SS guide doc) - we could make it be strict (via having comparison logic), or leave the loose guaranteeing as it is. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
