HyukjinKwon commented on PR #56721: URL: https://github.com/apache/spark/pull/56721#issuecomment-4786772644
⚠️ **Hold merge — this fix is not robust; I'm pushing a correction.** A full Maven (Scala 2.13, JDK 21) integration run surfaced that the `(encoding = avro)` variant of this test still fails with the **new** wait: ``` snapshotStartBatchId with transformWithState (with changelog checkpointing) (encoding = avro) *** FAILED *** The code passed to eventually never returned normally. Attempted 608 times over 1.0 minutes. Last failure message: snapshotUploaded was false Snapshot (version 2) for partition 1 was not uploaded in time. ``` Root cause is deeper than my first patch assumed: the background maintenance thread snapshots the **current** version, so a version-2 `.zip` only ever gets created while version 2 is the current version (between the 2nd and 3rd batches). Waiting for it at the **end** (when version 5 is current) can never make it appear — so when maintenance didn't happen to snapshot v2 during processing, the wait just times out (a different flake, not a fix). Correct fix: wait for the version-2 snapshot **right after version 2 is committed** (while it is still current), which deterministically forces maintenance to create it before more batches advance the version. I'll push that and re-validate **all** variants (unsaferow + avro) with multiple runs before this should be merged. (My earlier validation only multi-ran the `unsaferow` variant — that was the gap.) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
