viirya edited a comment on pull request #35521:
URL: https://github.com/apache/spark/pull/35521#issuecomment-1039933470


   > Suppose we have perfect watermark with no delay allowance, and there is no 
event being out of order, then streaming dedup will do nothing on deduplication 
because effectively it will register the row in the state and evict it 
immediately. This will happen if you use watermark but use it like the semantic 
of "processing time".
   
   How it could happen? IIUC, watermark predicate should be watermark column <= 
current watermark (max event time seen in last batch?). When no out of order 
events, isn't a input row's watermark column always > current watermark? (i.e. 
watermark predicate is false)? Why it will be evicted immediately? Won't it be 
evicted in next batch?
   
   TTL may be a solution here. Just watermark seems more commonly used in 
Structured Streaming operators, do we have any stateful operators with TTL? Or 
we need to introduce a state TTL mechanism for this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to