HeartSaVioR edited a comment on pull request #35521: URL: https://github.com/apache/spark/pull/35521#issuecomment-1039946618
> How it could happen? IIUC, watermark predicate should be watermark column <= current watermark (max event time seen in last batch?). When no out of order events, isn't a input row's watermark column always > current watermark? (i.e. watermark predicate is false)? Why it will be evicted immediately? Won't it be evicted in next batch? Yeah my bad I simplified on explanation too much. That is effectively evicted in next batch because we calculate the watermark at the end of micro-batch. How it would be helpful if the state row keeps around a single batch? We have to think thoughtfully about "when" we can evict the state row safely. Suppose the input rows having all duplications keep coming with timestamp. With event time semantic and set of the watermark gap we can deduplicate a group of input rows, but once the state row is evicted out we will produce a new output and put to state, which makes the output be indeterministic. How long it will deduplicate the events depends on the event time value of the first event. We still need to try hard to keep the output of batch query and streaming query be same. Letting state grow indefinitely would bring the same output (unless they leverage other columns than specified in dropDuplicate), so we need to start from there, and find a way how to deal with state growing with minimizing on hurting the output, say, tolerable way. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
