[ 
https://issues.apache.org/jira/browse/SPARK-56773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-56773:
-----------------------------------
    Labels: pull-request-available  (was: )

> Add more knobs to INJECT_SHUFFLE_FETCH_FAILURES
> -----------------------------------------------
>
>                 Key: SPARK-56773
>                 URL: https://issues.apache.org/jira/browse/SPARK-56773
>             Project: Spark
>          Issue Type: Test
>          Components: Spark Core
>    Affects Versions: 4.2.0
>            Reporter: Juliusz Sompolski
>            Priority: Major
>              Labels: pull-request-available
>
> Spark's existing test-only `INJECT_SHUFFLE_FETCH_FAILURES` flag corrupts 
> every map task of stage attempt 0, but only **leaf** shuffle map stages 
> (scans) succeed on attempt 0. Non-leaf shuffle map stages fail-fetch from 
> their corrupted upstream on attempt 0, succeed on a later attempt, and are 
> then never themselves corrupted under the existing flag. That makes it 
> impossible to write unit tests for the full range of stage-retry shapes that 
> production hits (e.g. metric stability under "lost shuffle on recompute", 
> rollback behaviour for indeterminate shuffle producers, SLAM semantics across 
> retries).
> This proposes extending the DAGScheduler injection infrastructure with three 
> orthogonal knobs:
> 1. **`INJECT_SHUFFLE_FETCH_FAILURES`** (existing master switch, semantics 
> changed). Always corrupts the partition-0 task of the first SUCCESSFUL 
> attempt of every shuffle map stage. Non-leaf stages get hit too, so their 
> first-success outputs are invalidated and a downstream-induced re-run 
> exercises the same simulated non-determinism a real lost-shuffle on recompute 
> would produce.
> 2. **`INJECT_SHUFFLE_FETCH_FAILURES_DOWNSTREAM_DELAY`** (int, default 1, 
> new). Defers the producer's mapper-0 corruption until N task-success events 
> have been observed in stages that directly consume the shuffle. Because the 
> DAGScheduler event loop processes completion events serially, this guarantees 
> N consumer tasks fully completed BEFORE the FetchFailed cascade kicks in - 
> the realistic "consumer made progress before lost shuffle on recompute" 
> shape, rather than the unrealistic "everything fails before anything 
> succeeds" of corrupting at producer registration time. Set to 0 to corrupt 
> inline at registration.
> 3. **`INJECT_SHUFFLE_FORCE_CHECKSUM_MISMATCH_ON_RECOMPUTE`** (boolean, 
> default false, new). After a downstream FetchFailed forces the producer's 
> partition 0 to be recomputed, register that recomputed MapStatus as if it had 
> a checksum mismatch with the original. This drives 
> `rollbackSucceedingStages`, which clears the downstream ShuffleMapStage's 
> outputs and forces a full retry. Mirrors the production path where a 
> recomputed shuffle output has a different checksum from a non-deterministic 
> producer. ResultStage downstreams are aborted (OSS Spark does not support 
> rolling them back, SPARK-25342).
> All three knobs are gated on `Utils.isTesting` and documented in 
> `Tests.scala`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to