HeartSaVioR commented on PR #48297: URL: https://github.com/apache/spark/pull/48297#issuecomment-2408305067
The overall direction of watermark is to advance as fast as we see safe and not break the simplicity of current watermark model (there might be trade-off). I might not put the design discussion into JIRA ticket, but I got an input internally when I designed supporting multiple stateful operators - why not just advance watermark based on state watermark e.g. based on completed windows for window aggregation. This technically delays the advance of watermark by one batch "per operator", due to the mechanism of how we calculate and propagate watermark (at the planning rather than within microbatch). So we rejected it and tolerate some tricky situation like this. That said, the way we do is by design/intention. If you see the feedback from @andrzejzera who reported the correctness issue, he even said it's uneasy to intuitively follow the behavior because we delay producing output than it is theoretically possible to. https://lists.apache.org/thread/ysxmtqc1kycthnk0wjmts9sztkt1ofp2 So further delaying to produce output does not sound to me as an option. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
