[ 
https://issues.apache.org/jira/browse/SPARK-31373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuchen Feng updated SPARK-31373:
--------------------------------
    Description: 
We enabled blacklist on our Spark application but recently we saw some wierd 
issue.

Our code is like
  {{rdd.repartitions(...).mapPartitions(...).groupByKey(...).map().collect()}}
In mapPartitions stage, some executors has exception "Can't connect to host 
xxxxxx: Connection rest by peer" and tasks on them were failed, so all 
executors under this node were blacklisted, as well as this node. These 
executors did complete some tasks before blacklisted.

Then in next stage (groupByKey(...).map()), application failed with block fetch 
failure: IndexOutOfBound Exception when some healthy executor want to fetch 
block from one of above blacklisted executors.

It happened multiple times.

  was:
We enabled blacklist on our Spark application but recently we saw some wierd 
issue.

Our code is like
  rdd.repartitions(...).mapPartitions(...).groupByKey(...).map().collect()
 \{{}}In mapPartitions stage, some executors has exception "Can't connect to 
host xxxxxx: Connection rest by peer" and tasks on them were failed, so all 
executors under this node were blacklisted, as well as this node. These 
executors did complete some tasks before blacklisted.

Then in next stage (groupByKey(...).map()), application failed with block fetch 
failure: IndexOutOfBound Exception when some healthy executor want to fetch 
block from one of above blacklisted executors.

It happened multiple times.


> Cluster tried to fetch blocks from blacklisted node of previous stage
> ---------------------------------------------------------------------
>
>                 Key: SPARK-31373
>                 URL: https://issues.apache.org/jira/browse/SPARK-31373
>             Project: Spark
>          Issue Type: Question
>          Components: Block Manager
>    Affects Versions: 2.4.2
>         Environment: EMR cluster with r5.4xlarge and r5.8xlarge instances
>            Reporter: Yuchen Feng
>            Priority: Major
>
> We enabled blacklist on our Spark application but recently we saw some wierd 
> issue.
> Our code is like
>   {{rdd.repartitions(...).mapPartitions(...).groupByKey(...).map().collect()}}
> In mapPartitions stage, some executors has exception "Can't connect to host 
> xxxxxx: Connection rest by peer" and tasks on them were failed, so all 
> executors under this node were blacklisted, as well as this node. These 
> executors did complete some tasks before blacklisted.
> Then in next stage (groupByKey(...).map()), application failed with block 
> fetch failure: IndexOutOfBound Exception when some healthy executor want to 
> fetch block from one of above blacklisted executors.
> It happened multiple times.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to