gengliangwang commented on PR #55636: URL: https://github.com/apache/spark/pull/55636#issuecomment-4357438354
@viirya re-reading your concern more carefully: the atomic-microbatch contract from dee5e84 only covers the *same-commit-split* case. The case you also called out — *different commits with the same `_commit_timestamp` arriving in different micro-batches* — is still broken under that contract alone, since Spark's late-event filter also uses `LessThanOrEqual` (the same `WatermarkSupport.watermarkExpression` at `statefulOperators.scala:633-651` is shared between eviction and late-event filtering). So a v2-row at ts=T arriving in batch 2 after batch 1 advanced the watermark to T would be silently dropped as late. Tightened the contract again in ffa0646 to add a second requirement: **distinct `_commit_version` values must have distinct `_commit_timestamp` values** when streaming post-processing is enabled. That rules out the different-commit collision. Atomic-commit CDC connectors that derive `_commit_timestamp` from wall-clock time at commit time (Delta, Iceberg) naturally satisfy this — and the contract fails fast at the connector boundary if a connector violates it (rather than silently producing wrong results). I still didn't go with a non-zero watermark delay because it would impose latency on every user even when the connector respects the contract; the explicit two-requirement contract makes the failure mode discoverable instead. Happy to revisit if real connectors turn out to need the slack. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
