[jira] [Created] (SPARK-44389) ExecutorDeadException when using decommissioning without external shuffle service

Volodymyr Kot (Jira) Wed, 12 Jul 2023 08:33:40 -0700

Volodymyr Kot created SPARK-44389:
-------------------------------------

             Summary: ExecutorDeadException when using decommissioning without 
external shuffle service
                 Key: SPARK-44389
                 URL: https://issues.apache.org/jira/browse/SPARK-44389
             Project: Spark
          Issue Type: Question
          Components: Spark Core
    Affects Versions: 3.4.0
            Reporter: Volodymyr Kot



Hey, we are trying to use executor decommissioning without external shuffle 
service. We are trying to understand:
 # How often should we expect to see ExecutorDeadException? How is information 
about changes to location of blocks is propagated?
 # Whether the task should be re-submited if we hit that during decommissioning?

 

Current behavior that we observe:
 # Executor 1 is decommissioned
 # Driver successfully removes executor 1's block manager 
[here|https://github.com/apache/spark/blob/87a5442f7ed96b11051d8a9333476d080054e5a0/core/src/main/scala/org/apache/spark/storage/BlockManagerMaster.scala#L44]
 # A task is started on executor 2
 # We hit `ExecutorDeadException` on executor 2 when trying to fetch blocks 
from executor 1 
[here|https://github.com/apache/spark/blob/87a5442f7ed96b11051d8a9333476d080054e5a0/core/src/main/scala/org/apache/spark/network/netty/NettyBlockTransferService.scala#L139-L140]
 # Task on executor 2 fails
 # Stage fails
 # Stage is re-submitted and succeeds

As far as we understand, this happens because executor 2 has stale [map status 
cache|https://github.com/apache/spark/blob/87a5442f7ed96b11051d8a9333476d080054e5a0/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L1235-L1236]

Is that expected behavior? Shouldn't the task be retried in that case instead 
of whole stage failing and being retried? This makes Spark job execution 
longer, especially if there are a lot of decommission events.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-44389) ExecutorDeadException when using decommissioning without external shuffle service

Reply via email to