Re: [PR] [SPARK-56686][SQL] Support streaming row-level CDC post-processing [spark]

via GitHub Thu, 30 Apr 2026 19:19:40 -0700


gengliangwang commented on PR #55636:
URL: https://github.com/apache/spark/pull/55636#issuecomment-4357438354


   @viirya re-reading your concern more carefully: the atomic-microbatch 
contract from dee5e84 only covers the *same-commit-split* case. The case you 
also called out — *different commits with the same `_commit_timestamp` arriving 
in different micro-batches* — is still broken under that contract alone, since 
Spark's late-event filter also uses `LessThanOrEqual` (the same 
`WatermarkSupport.watermarkExpression` at `statefulOperators.scala:633-651` is 
shared between eviction and late-event filtering). So a v2-row at ts=T arriving 
in batch 2 after batch 1 advanced the watermark to T would be silently dropped 
as late.
   
   Tightened the contract again in ffa0646 to add a second requirement: 
**distinct `_commit_version` values must have distinct `_commit_timestamp` 
values** when streaming post-processing is enabled. That rules out the 
different-commit collision. Atomic-commit CDC connectors that derive 
`_commit_timestamp` from wall-clock time at commit time (Delta, Iceberg) 
naturally satisfy this — and the contract fails fast at the connector boundary 
if a connector violates it (rather than silently producing wrong results).
   
   I still didn't go with a non-zero watermark delay because it would impose 
latency on every user even when the connector respects the contract; the 
explicit two-requirement contract makes the failure mode discoverable instead. 
Happy to revisit if real connectors turn out to need the slack.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-56686][SQL] Support streaming row-level CDC post-processing [spark]

Reply via email to