[
https://issues.apache.org/jira/browse/SPARK-31373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yuchen Feng updated SPARK-31373:
--------------------------------
Environment: EMR cluster with r5.4xlarge and r5.8xlarge instances
> Cluster tried to fetch blocks from blacklisted node of previous stage
> ---------------------------------------------------------------------
>
> Key: SPARK-31373
> URL: https://issues.apache.org/jira/browse/SPARK-31373
> Project: Spark
> Issue Type: Question
> Components: Block Manager
> Affects Versions: 2.4.2
> Environment: EMR cluster with r5.4xlarge and r5.8xlarge instances
> Reporter: Yuchen Feng
> Priority: Major
>
> We enabled blacklist on our Spark application but recently we saw some wierd
> issue.
> Our code is like
> rdd.repartitions(...).mapPartitions(...).groupByKey(...).map().collect()
> {{}}In mapPartitions stage, some executors has exception "Can't connect to
> host xxxxxx: Connection rest by peer" and tasks on them were failed, so all
> executors under this node were blacklisted, as well as this node. These
> executors did complete some tasks before blacklisted.
> Then in next stage (groupByKey(...).map()), application failed with fetch
> failure: IndexOutOfBound Exception when some healthy executor want to fetch
> block from one of above blacklisted executors.
> It happened multiple times.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]