HyukjinKwon commented on PR #56721:
URL: https://github.com/apache/spark/pull/56721#issuecomment-4786346873

   **Self-review (code-review skill, high effort) — posted as a comment, not an 
approval.**
   
   Scope: 1 file, +21/-2 (test only). Reviewed for correctness, 
removed-behavior, cross-file, simplification, altitude, conventions.
   
   ### Correctness ✅ (no bugs found)
   - **Version is right:** the wait polls for snapshot version `2` 
(`2(_.*)?\.zip`), which is exactly what the reader needs — 
`snapshotStartBatchId = 1` ⇒ snapshot version `1 + 1 = 2`, and matches the 
original failure path (`.../state/0/1/2.zip`).
   - **Partitions are right:** it waits on partitions `1` and `4`, the same 
partitions the test reads below (`snapshotPartitionId` 1 and 4); `operatorId = 
0` ⇒ `state/0`. `SHUFFLE_PARTITIONS = 5` so those partitions exist.
   - **Null-safe:** `Option(partitionDir.listFiles())` handles the 
dir-not-yet-created case (returns `Seq.empty`, keeps polling) instead of NPE.
   - **Regex is anchored:** `matches` is whole-string, so `2(_.*)?\.zip` 
matches `2.zip` and `2_<uid>.zip` (v1 and checkpoint-v2 naming) but not 
`12.zip`/`20.zip` — no false positives across the suite subclasses.
   - **No lost coverage vs the removed `Thread.sleep`:** changelogs are 
uploaded synchronously at commit, so only the snapshot `.zip` needed the wait; 
the new wait covers exactly that, deterministically. Waiting immediately before 
`StopStream` also guarantees the file is present when state freezes (strictly 
better than the sleep).
   - Validated by running the test **10×** in one sbt session (all green, 
linked in the description).
   
   ### Minor (maintainability)
   - The snapshot version `2` is hardcoded to track the hardcoded 
`snapshotStartBatchId = 1`. If a future edit changes `snapshotStartBatchId`, 
this wait silently goes stale (times out / waits on the wrong file). Low risk 
since both are test-local constants a few lines apart, but a one-line comment 
tying them together wouldn't hurt.
   
   ### Optional follow-up (altitude, out of scope here)
   - A deeper alternative would be to make the `snapshotStartBatchId` reader 
tolerant of a not-yet-uploaded snapshot at the product level, rather than each 
test waiting. That's a behavior change beyond a test deflake; flagging only as 
a possible future direction.
   
   No blocking concerns.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to