HyukjinKwon commented on PR #56721: URL: https://github.com/apache/spark/pull/56721#issuecomment-4786346873
**Self-review (code-review skill, high effort) — posted as a comment, not an approval.** Scope: 1 file, +21/-2 (test only). Reviewed for correctness, removed-behavior, cross-file, simplification, altitude, conventions. ### Correctness ✅ (no bugs found) - **Version is right:** the wait polls for snapshot version `2` (`2(_.*)?\.zip`), which is exactly what the reader needs — `snapshotStartBatchId = 1` ⇒ snapshot version `1 + 1 = 2`, and matches the original failure path (`.../state/0/1/2.zip`). - **Partitions are right:** it waits on partitions `1` and `4`, the same partitions the test reads below (`snapshotPartitionId` 1 and 4); `operatorId = 0` ⇒ `state/0`. `SHUFFLE_PARTITIONS = 5` so those partitions exist. - **Null-safe:** `Option(partitionDir.listFiles())` handles the dir-not-yet-created case (returns `Seq.empty`, keeps polling) instead of NPE. - **Regex is anchored:** `matches` is whole-string, so `2(_.*)?\.zip` matches `2.zip` and `2_<uid>.zip` (v1 and checkpoint-v2 naming) but not `12.zip`/`20.zip` — no false positives across the suite subclasses. - **No lost coverage vs the removed `Thread.sleep`:** changelogs are uploaded synchronously at commit, so only the snapshot `.zip` needed the wait; the new wait covers exactly that, deterministically. Waiting immediately before `StopStream` also guarantees the file is present when state freezes (strictly better than the sleep). - Validated by running the test **10×** in one sbt session (all green, linked in the description). ### Minor (maintainability) - The snapshot version `2` is hardcoded to track the hardcoded `snapshotStartBatchId = 1`. If a future edit changes `snapshotStartBatchId`, this wait silently goes stale (times out / waits on the wrong file). Low risk since both are test-local constants a few lines apart, but a one-line comment tying them together wouldn't hurt. ### Optional follow-up (altitude, out of scope here) - A deeper alternative would be to make the `snapshotStartBatchId` reader tolerant of a not-yet-uploaded snapshot at the product level, rather than each test waiting. That's a behavior change beyond a test deflake; flagging only as a possible future direction. No blocking concerns. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
