gengliangwang commented on PR #55636: URL: https://github.com/apache/spark/pull/55636#issuecomment-4357417610
@viirya re: the streaming `_commit_timestamp` contract concern — you're right. Verified against `statefulOperators.scala:643-650` that the eviction predicate is `LessThanOrEqual` (i.e., a group at event time T evicts when watermark >= T), so equal-timestamp rows arriving in a later micro-batch *would* be dropped as late under the previous "non-decreasing" wording. Tightened the contract in dee5e84: `Changelog.java` now requires that **all rows of a single commit appear in the same micro-batch** (i.e. micro-batch boundaries align with commit boundaries) for streaming reads with post-processing. This is the natural atomic-commit emission pattern of real CDC connectors (Delta versions, Iceberg snapshots), and it makes the zero-delay watermark sound: within any micro-batch, all rows of a commit are observed before the watermark advances past that commit's timestamp. I considered a non-zero delay as a belt-and-suspenders guard, but it would only paper over connectors that violate the atomic-microbatch rule (a contract violation anyway), and would delay *every* user's output by the chosen interval. Happy to revisit if the explicit contract turns out to be too restrictive in practice. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
