HeartSaVioR commented on pull request #35521:
URL: https://github.com/apache/spark/pull/35521#issuecomment-1043728126


   The semantic of watermark gap is an allowance of "late events". The semantic 
of TTL here is an allowance of "events in near future". The guaranteeing of 
both are quite opposite.
   
   Suppose we have an event E2 timestamped as 12:00 as input and there was an 
event E1 timestamped as 11:50. When E2 is processed, the availability of E1 is 
totally depending on the advance of watermark. E1 may have evicted before, 
which leads E2 to be provided as an output. That is not guarded by the 
guaranteeing of the watermark.
   
   With TTL & event time semantic, if the TTL is set to 30 mins, it is 
guaranteed by the semantic of watermark that E1 is available when E2 is 
processed. The earliest time E1 can be evicted is at 12:20. It is possible that 
E1 lives more than the TTL (for a specific single batch) and deduplicates more 
events than we expect (the guaranteeing of watermark is one way as we 
documented in SS guide doc) - we could make it be strict (via having comparison 
logic), or leave the loose guaranteeing as it is.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to