[
https://issues.apache.org/jira/browse/SPARK-24346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16782739#comment-16782739
]
Mohamed Mehdi BEN AISSA commented on SPARK-24346:
-------------------------------------------------
Many thanks [~kien_truong] !
Speculation also can resolve the issue but as you said, we have to find the
root cause of this issue..
> Executors are unable to fetch remote cache blocks
> -------------------------------------------------
>
> Key: SPARK-24346
> URL: https://issues.apache.org/jira/browse/SPARK-24346
> Project: Spark
> Issue Type: Bug
> Components: Shuffle, Spark Core
> Affects Versions: 2.3.0
> Environment: OS: Centos 7.3
> Cluster: Hortonwork HDP 2.6.5 with Spark 2.3.0
> Reporter: Truong Duc Kien
> Priority: Major
>
> After we upgrade from Spark 2.2.1 to Spark 2.3.0, our Spark jobs took a
> massive performance hit because executors become unable to fetch remote cache
> block from each others. The scenario is:
> 1. An executor creates a connection and sends a ChunkFetchRequest message to
> another executor.
> 2. This request arrives at the target executor, which sends back a
> ChunkFetchSuccess response
> 3. The ChunkFetchSuccess msg never arrives.
> 4. The connection between these two executors is killed by the originating
> executor after 120s of idleness. At the same time, the other executor report
> that it failed to send the ChunkFetchSuccess because the pipe is closed.
> This process repeats itself 3 times, delaying our jobs by 6 minutes, then the
> originating executor decides to stop fetching and calculates the block by
> itself and the job can continue.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]