[
https://issues.apache.org/jira/browse/SPARK-36255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chandni Singh updated SPARK-36255:
----------------------------------
Description:
Once the shuffle is cleaned up by theĀ {{ContextCleaner}}, the shuffle files are
deleted by the executors. In this case, the push of the shuffle data by the
executors can throw {{FileNotFoundException}}. When this exception is thrown
from the {{shuffle-block-push-thread}}, it causes the executor to fail. This is
because of the default uncaught exception handler for Spark daemon threads
which terminates the executor when there are uncaught exceptions for the daemon
threads.
{code:java}
21/06/17 16:03:57 ERROR util.SparkUncaughtExceptionHandler: Uncaught exception
in thread Thread[block-push-thread-1,5,main]
java.lang.Error: java.io.IOException: Error in opening FileSegmentManagedBuffer
{file=********/application_1619720975011_11057757/blockmgr-560cb4cf-9918-4ea7-a007-a16c5e3a35fe/0a/shuffle_1_690_0.data,
offset=10640, length=190}
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1155)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Error in opening
FileSegmentManagedBuffer\{file=*******/application_1619720975011_11057757/blockmgr-560cb4cf-9918-4ea7-a007-a16c5e3a35fe/0a/shuffle_1_690_0.data,
offset=10640, length=190}
at
org.apache.spark.network.buffer.FileSegmentManagedBuffer.nioByteBuffer(FileSegmentManagedBuffer.java:89)
at
org.apache.spark.shuffle.ShuffleWriter.sliceReqBufferIntoBlockBuffers(ShuffleWriter.scala:294)
at
org.apache.spark.shuffle.ShuffleWriter.org$apache$spark$shuffle$ShuffleWriter$$sendRequest(ShuffleWriter.scala:270)
at
org.apache.spark.shuffle.ShuffleWriter.org$apache$spark$shuffle$ShuffleWriter$$pushUpToMax(ShuffleWriter.scala:191)
at
org.apache.spark.shuffle.ShuffleWriter$$anon$2$$anon$4.run(ShuffleWriter.scala:244)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
... 2 more
Caused by: java.io.FileNotFoundException:
******/application_1619720975011_11057757/blockmgr-560cb4cf-9918-4ea7-a007-a16c5e3a35fe/0a/shuffle_1_690_0.data
(No such file or directory)
at java.io.RandomAccessFile.open0(Native Method)
at java.io.RandomAccessFile.open(RandomAccessFile.java:316)
at java.io.RandomAccessFile.<init>(RandomAccessFile.java:243)
at
org.apache.spark.network.buffer.FileSegmentManagedBuffer.nioByteBuffer(FileSegmentManagedBuffer.java:62)
{code}
We can address the issue by handling "FileNotFound" exceptions in the push
threads and netty threads by stopping the push when {{FileNotFound}} is
encountered.
was:
When the shuffle files are cleaned up by the executors once a job in a Spark
application completes, the push of the shuffle data by the executor can throw
FileNotFound exception. When this exception is thrown from the
{{shuffle-block-push-thread}}, it causes the executor to fail. This is because
of the default uncaught exception handler for Spark daemon threads which
terminates the executor when there are uncaught exceptions for the daemon
threads.
{code:java}
21/06/17 16:03:57 ERROR util.SparkUncaughtExceptionHandler: Uncaught exception
in thread Thread[block-push-thread-1,5,main]
java.lang.Error: java.io.IOException: Error in opening FileSegmentManagedBuffer
{file=********/application_1619720975011_11057757/blockmgr-560cb4cf-9918-4ea7-a007-a16c5e3a35fe/0a/shuffle_1_690_0.data,
offset=10640, length=190}
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1155)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Error in opening
FileSegmentManagedBuffer\{file=*******/application_1619720975011_11057757/blockmgr-560cb4cf-9918-4ea7-a007-a16c5e3a35fe/0a/shuffle_1_690_0.data,
offset=10640, length=190}
at
org.apache.spark.network.buffer.FileSegmentManagedBuffer.nioByteBuffer(FileSegmentManagedBuffer.java:89)
at
org.apache.spark.shuffle.ShuffleWriter.sliceReqBufferIntoBlockBuffers(ShuffleWriter.scala:294)
at
org.apache.spark.shuffle.ShuffleWriter.org$apache$spark$shuffle$ShuffleWriter$$sendRequest(ShuffleWriter.scala:270)
at
org.apache.spark.shuffle.ShuffleWriter.org$apache$spark$shuffle$ShuffleWriter$$pushUpToMax(ShuffleWriter.scala:191)
at
org.apache.spark.shuffle.ShuffleWriter$$anon$2$$anon$4.run(ShuffleWriter.scala:244)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
... 2 more
Caused by: java.io.FileNotFoundException:
******/application_1619720975011_11057757/blockmgr-560cb4cf-9918-4ea7-a007-a16c5e3a35fe/0a/shuffle_1_690_0.data
(No such file or directory)
at java.io.RandomAccessFile.open0(Native Method)
at java.io.RandomAccessFile.open(RandomAccessFile.java:316)
at java.io.RandomAccessFile.<init>(RandomAccessFile.java:243)
at
org.apache.spark.network.buffer.FileSegmentManagedBuffer.nioByteBuffer(FileSegmentManagedBuffer.java:62)
{code}
We can address the issue by handling "FileNotFound" exceptions in the push
threads and netty threads by stopping the push when {{FileNotFound}} is
encountered.
> FileNotFoundException from the shuffle push can cause the executor to
> terminate
> -------------------------------------------------------------------------------
>
> Key: SPARK-36255
> URL: https://issues.apache.org/jira/browse/SPARK-36255
> Project: Spark
> Issue Type: Sub-task
> Components: Shuffle
> Affects Versions: 3.1.0
> Reporter: Chandni Singh
> Priority: Major
>
> Once the shuffle is cleaned up by theĀ {{ContextCleaner}}, the shuffle files
> are deleted by the executors. In this case, the push of the shuffle data by
> the executors can throw {{FileNotFoundException}}. When this exception is
> thrown from the {{shuffle-block-push-thread}}, it causes the executor to
> fail. This is because of the default uncaught exception handler for Spark
> daemon threads which terminates the executor when there are uncaught
> exceptions for the daemon threads.
> {code:java}
> 21/06/17 16:03:57 ERROR util.SparkUncaughtExceptionHandler: Uncaught
> exception in thread Thread[block-push-thread-1,5,main]
> java.lang.Error: java.io.IOException: Error in opening
> FileSegmentManagedBuffer
> {file=********/application_1619720975011_11057757/blockmgr-560cb4cf-9918-4ea7-a007-a16c5e3a35fe/0a/shuffle_1_690_0.data,
> offset=10640, length=190}
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1155)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Error in opening
> FileSegmentManagedBuffer\{file=*******/application_1619720975011_11057757/blockmgr-560cb4cf-9918-4ea7-a007-a16c5e3a35fe/0a/shuffle_1_690_0.data,
> offset=10640, length=190}
> at
> org.apache.spark.network.buffer.FileSegmentManagedBuffer.nioByteBuffer(FileSegmentManagedBuffer.java:89)
> at
> org.apache.spark.shuffle.ShuffleWriter.sliceReqBufferIntoBlockBuffers(ShuffleWriter.scala:294)
> at
> org.apache.spark.shuffle.ShuffleWriter.org$apache$spark$shuffle$ShuffleWriter$$sendRequest(ShuffleWriter.scala:270)
> at
> org.apache.spark.shuffle.ShuffleWriter.org$apache$spark$shuffle$ShuffleWriter$$pushUpToMax(ShuffleWriter.scala:191)
> at
> org.apache.spark.shuffle.ShuffleWriter$$anon$2$$anon$4.run(ShuffleWriter.scala:244)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> ... 2 more
> Caused by: java.io.FileNotFoundException:
> ******/application_1619720975011_11057757/blockmgr-560cb4cf-9918-4ea7-a007-a16c5e3a35fe/0a/shuffle_1_690_0.data
> (No such file or directory)
> at java.io.RandomAccessFile.open0(Native Method)
> at java.io.RandomAccessFile.open(RandomAccessFile.java:316)
> at java.io.RandomAccessFile.<init>(RandomAccessFile.java:243)
> at
> org.apache.spark.network.buffer.FileSegmentManagedBuffer.nioByteBuffer(FileSegmentManagedBuffer.java:62)
> {code}
> We can address the issue by handling "FileNotFound" exceptions in the push
> threads and netty threads by stopping the push when {{FileNotFound}} is
> encountered.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]