HeartSaVioR edited a comment on pull request #35521:
URL: https://github.com/apache/spark/pull/35521#issuecomment-1039946618


   > How it could happen? IIUC, watermark predicate should be watermark column 
<= current watermark (max event time seen in last batch?). When no out of order 
events, isn't a input row's watermark column always > current watermark? (i.e. 
watermark predicate is false)? Why it will be evicted immediately? Won't it be 
evicted in next batch?
   
   Yeah my bad I simplified on explanation too much. That is effectively 
evicted in next batch because we calculate the watermark at the end of 
micro-batch. How it would be helpful if the state row keeps around a single 
batch?
   
   We have to think thoughtfully about "when" we can evict the state row 
safely. Suppose the input rows having all duplications keep coming with 
timestamp. With event time semantic and set of the watermark gap we can 
deduplicate a group of input rows, but once the state row is evicted out we 
will produce a new output and put to state, which makes the output be 
indeterministic. How long it will deduplicate the events depends on the event 
time value of the first event.
   
   We still need to try hard to keep the output of batch query and streaming 
query be same. Letting state grow indefinitely would bring the same output 
(unless they leverage other columns than specified in dropDuplicate), so we 
need to start from there, and find a way how to deal with state growing with 
minimizing on hurting the output, say, tolerable way.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to