Juliusz Sompolski created SPARK-56773:
-----------------------------------------
Summary: Add more knobs to INJECT_SHUFFLE_FETCH_FAILURES
Key: SPARK-56773
URL: https://issues.apache.org/jira/browse/SPARK-56773
Project: Spark
Issue Type: Test
Components: Spark Core
Affects Versions: 4.2.0
Reporter: Juliusz Sompolski
Spark's existing test-only `INJECT_SHUFFLE_FETCH_FAILURES` flag corrupts every
map task of stage attempt 0, but only **leaf** shuffle map stages (scans)
succeed on attempt 0. Non-leaf shuffle map stages fail-fetch from their
corrupted upstream on attempt 0, succeed on a later attempt, and are then never
themselves corrupted under the existing flag. That makes it impossible to write
unit tests for the full range of stage-retry shapes that production hits (e.g.
metric stability under "lost shuffle on recompute", rollback behaviour for
indeterminate shuffle producers, SLAM semantics across retries).
This proposes extending the DAGScheduler injection infrastructure with three
orthogonal knobs:
1. **`INJECT_SHUFFLE_FETCH_FAILURES`** (existing master switch, semantics
changed). Always corrupts the partition-0 task of the first SUCCESSFUL attempt
of every shuffle map stage. Non-leaf stages get hit too, so their first-success
outputs are invalidated and a downstream-induced re-run exercises the same
simulated non-determinism a real lost-shuffle on recompute would produce.
2. **`INJECT_SHUFFLE_FETCH_FAILURES_DOWNSTREAM_DELAY`** (int, default 1, new).
Defers the producer's mapper-0 corruption until N task-success events have been
observed in stages that directly consume the shuffle. Because the DAGScheduler
event loop processes completion events serially, this guarantees N consumer
tasks fully completed BEFORE the FetchFailed cascade kicks in - the realistic
"consumer made progress before lost shuffle on recompute" shape, rather than
the unrealistic "everything fails before anything succeeds" of corrupting at
producer registration time. Set to 0 to corrupt inline at registration.
3. **`INJECT_SHUFFLE_FORCE_CHECKSUM_MISMATCH_ON_RECOMPUTE`** (boolean, default
false, new). After a downstream FetchFailed forces the producer's partition 0
to be recomputed, register that recomputed MapStatus as if it had a checksum
mismatch with the original. This drives `rollbackSucceedingStages`, which
clears the downstream ShuffleMapStage's outputs and forces a full retry.
Mirrors the production path where a recomputed shuffle output has a different
checksum from a non-deterministic producer. ResultStage downstreams are aborted
(OSS Spark does not support rolling them back, SPARK-25342).
All three knobs are gated on `Utils.isTesting` and documented in `Tests.scala`.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]