Enrico Minack created SPARK-52505:
-------------------------------------

             Summary: Connection timeout on Kubernetes to decommissioned 
executor
                 Key: SPARK-52505
                 URL: https://issues.apache.org/jira/browse/SPARK-52505
             Project: Spark
          Issue Type: Sub-task
          Components: k8s, Kubernetes
    Affects Versions: 4.1.0
            Reporter: Enrico Minack


Running Spark on Kubernetes with storage decommissioning frequently runs into 
the situation where a task is started that has to read from another executor. 
By the time a connection is established, that other executor has been 
decommissioned, the data migrated, and the executor terminated. Connecting to 
that executor times out, which has a 2 minutes default timeout. The task blocks 
for that amount of time.

Reducing the timeout has only limited success as all connections are created 
one-by-one. Multiple such timeouts occur until the task hits a fetch failure. 
This delays the task execution significantly, impacting performance of the 
Spark job.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to