Enrico Minack created SPARK-52507:
-------------------------------------
Summary: Quick fallback to fallback storage on fetch failure
Key: SPARK-52507
URL: https://issues.apache.org/jira/browse/SPARK-52507
Project: Spark
Issue Type: Sub-task
Components: k8s, Kubernetes
Affects Versions: 4.1.0
Reporter: Enrico Minack
Using the fallback storage with storage decommissioning on Kubernetes can run
into the situation where some tasks try to read from an executor that has just
been decommissioned. The driver has updated location information of the
migrated shuffle data, but the task uses the outdated location.
Given we have the fallback storage enabled and shuffle data is always migrated
to the fallback storage only (SPARK-52506), it is very likely that a fetch
failure can be recovered from the fallback storage. The task does not need to
go through a fetch failure to restart the task or stage to get hold of the
update shuffle data location.
This requires
1. connections to decommissioned executors to quickly fail (connection refused
rather connection timeout), see SPARK-52505
2. storage migration only migrates to the fallback storage, see SPARK-52506
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]