Enrico Minack created SPARK-52507:
-------------------------------------

             Summary: Quick fallback to fallback storage on fetch failure
                 Key: SPARK-52507
                 URL: https://issues.apache.org/jira/browse/SPARK-52507
             Project: Spark
          Issue Type: Sub-task
          Components: k8s, Kubernetes
    Affects Versions: 4.1.0
            Reporter: Enrico Minack


Using the fallback storage with storage decommissioning on Kubernetes can run 
into the situation where some tasks try to read from an executor that has just 
been decommissioned. The driver has updated location information of the 
migrated shuffle data, but the task uses the outdated location.

Given we have the fallback storage enabled and shuffle data is always migrated 
to the fallback storage only (SPARK-52506), it is very likely that a fetch 
failure can be recovered from the fallback storage. The task does not need to 
go through a fetch failure to restart the task or stage to get hold of the 
update shuffle data location.

This requires

1. connections to decommissioned executors to quickly fail (connection refused 
rather connection timeout), see SPARK-52505
2. storage migration only migrates to the fallback storage, see SPARK-52506



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to