gengliangwang commented on PR #55636:
URL: https://github.com/apache/spark/pull/55636#issuecomment-4357417610

   @viirya re: the streaming `_commit_timestamp` contract concern — you're 
right. Verified against `statefulOperators.scala:643-650` that the eviction 
predicate is `LessThanOrEqual` (i.e., a group at event time T evicts when 
watermark >= T), so equal-timestamp rows arriving in a later micro-batch 
*would* be dropped as late under the previous "non-decreasing" wording.
   
   Tightened the contract in dee5e84: `Changelog.java` now requires that **all 
rows of a single commit appear in the same micro-batch** (i.e. micro-batch 
boundaries align with commit boundaries) for streaming reads with 
post-processing. This is the natural atomic-commit emission pattern of real CDC 
connectors (Delta versions, Iceberg snapshots), and it makes the zero-delay 
watermark sound: within any micro-batch, all rows of a commit are observed 
before the watermark advances past that commit's timestamp.
   
   I considered a non-zero delay as a belt-and-suspenders guard, but it would 
only paper over connectors that violate the atomic-microbatch rule (a contract 
violation anyway), and would delay *every* user's output by the chosen 
interval. Happy to revisit if the explicit contract turns out to be too 
restrictive in practice.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to