HeartSaVioR edited a comment on pull request #35521:
URL: https://github.com/apache/spark/pull/35521#issuecomment-1039922686


   I agree about the problem description, but I'd like to see more thoughtful 
solution.
   
   Suppose we have perfect watermark with no delay allowance, and there is no 
event being out of order, then streaming dedup will do nothing on deduplication 
because effectively it will register the row in the state and evict it 
immediately. This will happen if you use watermark but use it like the semantic 
of "processing time".
   
   Personally, for this case, applying TTL against state row would be more 
sense to me. If we don't want to enforce watermark for the functionality, then 
we will end up with wall/processing time for TTL which may fall into 
indeterministic result, but setting TTL as huge interval like 2 hours would be 
acceptable tolerating such behavior. If we want to enforce watermark to let TTL 
work (TTL working with event time column), we may even produce deterministic 
result except late events.
   
   Reference:
   
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/sql/queries/select-distinct/
   
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/config/#table-exec-state-ttl
   
   Note that Flink only supports processing time (wall time) semantic for state 
TTL, if I understand correctly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to