[
https://issues.apache.org/jira/browse/SPARK-52507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SPARK-52507:
-----------------------------------
Labels: pull-request-available (was: )
> Quick fallback to fallback storage on fetch failure
> ---------------------------------------------------
>
> Key: SPARK-52507
> URL: https://issues.apache.org/jira/browse/SPARK-52507
> Project: Spark
> Issue Type: Sub-task
> Components: Kubernetes
> Affects Versions: 4.1.0
> Reporter: Enrico Minack
> Priority: Major
> Labels: pull-request-available
>
> Using the fallback storage with storage decommissioning on Kubernetes can run
> into the situation where some tasks try to read from an executor that has
> just been decommissioned. The driver has updated location information of the
> migrated shuffle data, but the task uses the outdated location.
> Given we have the fallback storage enabled and shuffle data is always
> migrated to the fallback storage only (SPARK-52506), it is very likely that a
> fetch failure can be recovered from the fallback storage. The task does not
> need to go through a fetch failure to restart the task or stage to get hold
> of the update shuffle data location.
> This benefits from
> 1. connections to decommissioned executors to quickly fail (connection
> refused rather connection timeout), see SPARK-52505
> 2. storage migration only migrates to the fallback storage, see SPARK-52506
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]