nebi mert aydin created SPARK-44772: ---------------------------------------
Summary: Reading blocks from remote executors causes timeout issue Key: SPARK-44772 URL: https://issues.apache.org/jira/browse/SPARK-44772 Project: Spark Issue Type: Bug Components: EC2, PySpark Affects Versions: 3.1.2 Reporter: nebi mert aydin I'm using EMR 6.5 with Spark 3.1.2 I'm shuffling 1.5 TiB of data with 3000 executors with 4 cores 23 gig memory for executor `df.repartition(6000)` I see lots of failures with ``` 2023-08-11 01:01:09,847 ERROR org.apache.spark.network.server.ChunkFetchRequestHandler (shuffle-server-4-95): Error sending result ChunkFetchSuccess[streamChunkId=StreamChunkId[streamId=779084003612,chunkIndex=324],buffer=FileSegmentManagedBuffer[file=/mnt1/yarn/usercache/zeppelin/appcache/application_1691438567823_0012/blockmgr-b2f9bea5-068c-45c8-b324-1f132c87de54/24/shuffle_5_115515_0.data,offset=680394,length=255]] to /172.31.20.110:36654; closing connection ``` I tried to set this for kernel ``` sudo ethtool -K eth0 tso off sudo ethtool -K eth0 sg off ``` Didn't work. I guess external shuffle service is not able to send to data to other executors due to some reason. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org