nebi mert aydin created SPARK-44772:
---------------------------------------

             Summary: Reading blocks from remote executors  causes timeout issue
                 Key: SPARK-44772
                 URL: https://issues.apache.org/jira/browse/SPARK-44772
             Project: Spark
          Issue Type: Bug
          Components: EC2, PySpark
    Affects Versions: 3.1.2
            Reporter: nebi mert aydin


I'm using EMR 6.5 with Spark 3.1.2

I'm shuffling 1.5 TiB of data with 3000 executors with 4 cores 23 gig memory 
for executor
`df.repartition(6000)`
I see lots of failures with 

```

2023-08-11 01:01:09,847 ERROR 
org.apache.spark.network.server.ChunkFetchRequestHandler (shuffle-server-4-95): 
Error sending result 
ChunkFetchSuccess[streamChunkId=StreamChunkId[streamId=779084003612,chunkIndex=324],buffer=FileSegmentManagedBuffer[file=/mnt1/yarn/usercache/zeppelin/appcache/application_1691438567823_0012/blockmgr-b2f9bea5-068c-45c8-b324-1f132c87de54/24/shuffle_5_115515_0.data,offset=680394,length=255]]
 to /172.31.20.110:36654; closing connection

```

I tried to set this for kernel

```

sudo ethtool -K eth0 tso off
sudo ethtool -K eth0 sg off

```

Didn't work. I guess external shuffle service is not able to send to data to 
other executors due to some reason.

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to