shanyu zhao created SPARK-29003:
-----------------------------------
Summary: Spark history server startup hang due to deadlock
Key: SPARK-29003
URL: https://issues.apache.org/jira/browse/SPARK-29003
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 2.4.4
Reporter: shanyu zhao
Occasionally when starting Spark History Server, the service process will hang
before binding to the port so Spark History Server is not usable. One has to
kill the process and start again. You can write a simple bash program to stop
and start Spark History Server and you can reproduce this problem approximately
10% of time.
The problem is due to java.nio.file.FileSystems.getDefault() cause deadlock.
This is what I collected with jstack:
{code:java}
"log-replay-executor-0" #17 daemon prio=5 os_prio=0 tid=0x00007fca90028800
nid=0x6e8 in Object.wait() [0x00007fcaa9471000]"log-replay-executor-0" #17
daemon prio=5 os_prio=0 tid=0x00007fca90028800 nid=0x6e8 in Object.wait()
[0x00007fcaa9471000] java.lang.Thread.State: RUNNABLE at
java.nio.file.FileSystems.getDefault(FileSystems.java:176) ... at
java.lang.Runtime.loadLibrary0(Runtime.java:870) - locked <0x00000000aaac1d40>
(a java.lang.Runtime) ... at
org.apache.spark.deploy.history.FsHistoryProvider.mergeApplicationListing(FsHistoryProvider.scala:698)
"main" #1 prio=5 os_prio=0 tid=0x00007fcad8016800 nid=0x6d8 waiting for monitor
entry [0x00007fcae146c000] java.lang.Thread.State: BLOCKED (on object
monitor) at java.lang.Runtime.loadLibrary0(Runtime.java:862) - waiting to lock
<0x00000000aaac1d40> (a java.lang.Runtime) ... at
java.nio.file.FileSystems.getDefault(FileSystems.java:176) at
java.io.File.toPath(File.java:2234) - locked <0x000000008699bb68> (a
java.io.File) ... at
org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:365){code}
Basically "main" thread and "log-replay-executor-0" thread simultaneously
calling java.nio,file.FileSystems.getDefault() and deadlocked.
This is similar to the reported JDK bug:
[https://bugs.openjdk.java.net/browse/JDK-8037567]
The problem is that during Spark History Server startup, there are two things
happening simultaneously that call into java.nio.file.FileSystems.getDefault():
1) start jetty server
2) start ApplicationHistoryProvider (which reads files from HDFS)
We should do this two things sequentially instead of in parallel.
--
This message was sent by Atlassian Jira
(v8.3.2#803003)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]