Juliusz Sompolski created SPARK-56773:
-----------------------------------------

             Summary: Add more knobs to INJECT_SHUFFLE_FETCH_FAILURES
                 Key: SPARK-56773
                 URL: https://issues.apache.org/jira/browse/SPARK-56773
             Project: Spark
          Issue Type: Test
          Components: Spark Core
    Affects Versions: 4.2.0
            Reporter: Juliusz Sompolski


Spark's existing test-only `INJECT_SHUFFLE_FETCH_FAILURES` flag corrupts every 
map task of stage attempt 0, but only **leaf** shuffle map stages (scans) 
succeed on attempt 0. Non-leaf shuffle map stages fail-fetch from their 
corrupted upstream on attempt 0, succeed on a later attempt, and are then never 
themselves corrupted under the existing flag. That makes it impossible to write 
unit tests for the full range of stage-retry shapes that production hits (e.g. 
metric stability under "lost shuffle on recompute", rollback behaviour for 
indeterminate shuffle producers, SLAM semantics across retries).

This proposes extending the DAGScheduler injection infrastructure with three 
orthogonal knobs:

1. **`INJECT_SHUFFLE_FETCH_FAILURES`** (existing master switch, semantics 
changed). Always corrupts the partition-0 task of the first SUCCESSFUL attempt 
of every shuffle map stage. Non-leaf stages get hit too, so their first-success 
outputs are invalidated and a downstream-induced re-run exercises the same 
simulated non-determinism a real lost-shuffle on recompute would produce.

2. **`INJECT_SHUFFLE_FETCH_FAILURES_DOWNSTREAM_DELAY`** (int, default 1, new). 
Defers the producer's mapper-0 corruption until N task-success events have been 
observed in stages that directly consume the shuffle. Because the DAGScheduler 
event loop processes completion events serially, this guarantees N consumer 
tasks fully completed BEFORE the FetchFailed cascade kicks in - the realistic 
"consumer made progress before lost shuffle on recompute" shape, rather than 
the unrealistic "everything fails before anything succeeds" of corrupting at 
producer registration time. Set to 0 to corrupt inline at registration.

3. **`INJECT_SHUFFLE_FORCE_CHECKSUM_MISMATCH_ON_RECOMPUTE`** (boolean, default 
false, new). After a downstream FetchFailed forces the producer's partition 0 
to be recomputed, register that recomputed MapStatus as if it had a checksum 
mismatch with the original. This drives `rollbackSucceedingStages`, which 
clears the downstream ShuffleMapStage's outputs and forces a full retry. 
Mirrors the production path where a recomputed shuffle output has a different 
checksum from a non-deterministic producer. ResultStage downstreams are aborted 
(OSS Spark does not support rolling them back, SPARK-25342).

All three knobs are gated on `Utils.isTesting` and documented in `Tests.scala`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to