Yuchen Feng created SPARK-31373:
-----------------------------------

             Summary: Cluster tried to fetch blocks from blacklisted node of 
previous stage
                 Key: SPARK-31373
                 URL: https://issues.apache.org/jira/browse/SPARK-31373
             Project: Spark
          Issue Type: Bug
          Components: Block Manager
    Affects Versions: 2.4.2
            Reporter: Yuchen Feng


We enabled blacklist on our Spark application but recently we saw some wierd 
issue.

Our code is like
 rdd.repartitions(...).mapPartitions(...).groupByKey(...).map().collect()
{{}}In mapPartitions stage, some executors has exception "Can't connect to host 
xxxxxx: Connection rest by peer" and tasks on them were failed, so all 
executors under this node were blacklisted, as well as this node. These 
executors did complete some tasks before blacklisted.

Then in next stage (groupByKey(...).map()), application failed with fetch 
failure: IndexOutOfBound Exception when some healthy executor want to fetch 
block from one of above blacklisted executors.

It happened multiple times.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to