[
https://issues.apache.org/jira/browse/SPARK-36255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chandni Singh updated SPARK-36255:
----------------------------------
Summary: FileNotFoundException from the shuffle push can cause the executor
to terminate (was: FileNotFound exceptions from the shuffle push can cause the
executor to terminate)
> FileNotFoundException from the shuffle push can cause the executor to
> terminate
> -------------------------------------------------------------------------------
>
> Key: SPARK-36255
> URL: https://issues.apache.org/jira/browse/SPARK-36255
> Project: Spark
> Issue Type: Sub-task
> Components: Shuffle
> Affects Versions: 3.1.0
> Reporter: Chandni Singh
> Priority: Major
>
> When the shuffle files are cleaned up by the executors once a job in a Spark
> application completes, the push of the shuffle data by the executor can throw
> FileNotFound exception. When this exception is thrown from the
> {{shuffle-block-push-thread}}, it causes the executor to fail. This is
> because of the default uncaught exception handler for Spark daemon threads
> which terminates the executor when there are uncaught exceptions for the
> daemon threads.
> {code:java}
> 21/06/17 16:03:57 ERROR util.SparkUncaughtExceptionHandler: Uncaught
> exception in thread Thread[block-push-thread-1,5,main]
> java.lang.Error: java.io.IOException: Error in opening
> FileSegmentManagedBuffer
> {file=********/application_1619720975011_11057757/blockmgr-560cb4cf-9918-4ea7-a007-a16c5e3a35fe/0a/shuffle_1_690_0.data,
> offset=10640, length=190}
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1155)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Error in opening
> FileSegmentManagedBuffer\{file=*******/application_1619720975011_11057757/blockmgr-560cb4cf-9918-4ea7-a007-a16c5e3a35fe/0a/shuffle_1_690_0.data,
> offset=10640, length=190}
> at
> org.apache.spark.network.buffer.FileSegmentManagedBuffer.nioByteBuffer(FileSegmentManagedBuffer.java:89)
> at
> org.apache.spark.shuffle.ShuffleWriter.sliceReqBufferIntoBlockBuffers(ShuffleWriter.scala:294)
> at
> org.apache.spark.shuffle.ShuffleWriter.org$apache$spark$shuffle$ShuffleWriter$$sendRequest(ShuffleWriter.scala:270)
> at
> org.apache.spark.shuffle.ShuffleWriter.org$apache$spark$shuffle$ShuffleWriter$$pushUpToMax(ShuffleWriter.scala:191)
> at
> org.apache.spark.shuffle.ShuffleWriter$$anon$2$$anon$4.run(ShuffleWriter.scala:244)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> ... 2 more
> Caused by: java.io.FileNotFoundException:
> ******/application_1619720975011_11057757/blockmgr-560cb4cf-9918-4ea7-a007-a16c5e3a35fe/0a/shuffle_1_690_0.data
> (No such file or directory)
> at java.io.RandomAccessFile.open0(Native Method)
> at java.io.RandomAccessFile.open(RandomAccessFile.java:316)
> at java.io.RandomAccessFile.<init>(RandomAccessFile.java:243)
> at
> org.apache.spark.network.buffer.FileSegmentManagedBuffer.nioByteBuffer(FileSegmentManagedBuffer.java:62)
> {code}
> We can address the issue by handling "FileNotFound" exceptions in the push
> threads and netty threads by stopping the push when {{FileNotFound}} is
> encountered.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]