Enrico Minack created SPARK-52505:
-------------------------------------
Summary: Connection timeout on Kubernetes to decommissioned
executor
Key: SPARK-52505
URL: https://issues.apache.org/jira/browse/SPARK-52505
Project: Spark
Issue Type: Sub-task
Components: k8s, Kubernetes
Affects Versions: 4.1.0
Reporter: Enrico Minack
Running Spark on Kubernetes with storage decommissioning frequently runs into
the situation where a task is started that has to read from another executor.
By the time a connection is established, that other executor has been
decommissioned, the data migrated, and the executor terminated. Connecting to
that executor times out, which has a 2 minutes default timeout. The task blocks
for that amount of time.
Reducing the timeout has only limited success as all connections are created
one-by-one. Multiple such timeouts occur until the task hits a fetch failure.
This delays the task execution significantly, impacting performance of the
Spark job.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]