[
https://issues.apache.org/jira/browse/SPARK-29003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16924581#comment-16924581
]
Jungtaek Lim edited comment on SPARK-29003 at 9/6/19 8:39 PM:
--------------------------------------------------------------
Thanks for providing jstack. Looks like it's known JDK issue but given the
fixed version is too high I agree we may need to apply workaround on this.
[https://bugs.openjdk.java.net/browse/JDK-8194653
]
Discussion:
[http://mail.openjdk.java.net/pipermail/core-libs-dev/2018-January/050830.html]
was (Author: kabhwan):
Thanks for providing jstack. Looks like it's known JDK issue but given the
fixed version is too high I agree we may need to apply workaround on this.
[https://bugs.openjdk.java.net/browse/JDK-8194653]
> Spark history server startup hang due to deadlock
> -------------------------------------------------
>
> Key: SPARK-29003
> URL: https://issues.apache.org/jira/browse/SPARK-29003
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 2.4.4
> Reporter: shanyu zhao
> Priority: Major
> Attachments: sparkhistory-jstack.log
>
>
> Occasionally when starting Spark History Server, the service process will
> hang before binding to the port so Spark History Server is not usable. One
> has to kill the process and start again. You can write a simple bash program
> to stop and start Spark History Server and you can reproduce this problem
> approximately 10% of time.
> The problem is due to java.nio.file.FileSystems.getDefault() cause deadlock.
> This is what I collected with jstack:
> {code:java}
> "log-replay-executor-0" #17 daemon prio=5 os_prio=0 tid=0x00007fca90028800
> nid=0x6e8 in Object.wait() [0x00007fcaa9471000]
> java.lang.Thread.State: RUNNABLE
> at java.nio.file.FileSystems.getDefault(FileSystems.java:176)
> ...
> at java.lang.Runtime.loadLibrary0(Runtime.java:870) - locked
> <0x00000000aaac1d40> (a java.lang.Runtime)
> ...
> at
> org.apache.spark.deploy.history.FsHistoryProvider.mergeApplicationListing(FsHistoryProvider.scala:698)
> "main" #1 prio=5 os_prio=0 tid=0x00007fcad8016800 nid=0x6d8 waiting for
> monitor entry [0x00007fcae146c000]
> java.lang.Thread.State: BLOCKED (on object monitor)
> at java.lang.Runtime.loadLibrary0(Runtime.java:862) - waiting to lock
> <0x00000000aaac1d40> (a java.lang.Runtime)
> ...
> at java.nio.file.FileSystems.getDefault(FileSystems.java:176)
> at java.io.File.toPath(File.java:2234) - locked <0x000000008699bb68> (a
> java.io.File)
> ...
> at
> org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:365){code}
> Basically "main" thread and "log-replay-executor-0" thread simultaneously
> calling java.nio,file.FileSystems.getDefault() and deadlocked.
> This is similar to the reported JDK bug:
> [https://bugs.openjdk.java.net/browse/JDK-8037567]
> The problem is that during Spark History Server startup, there are two things
> happening simultaneously that call into
> java.nio.file.FileSystems.getDefault():
> 1) start jetty server
> 2) start ApplicationHistoryProvider (which reads files from HDFS)
> We should do this two things sequentially instead of in parallel.
>
--
This message was sent by Atlassian Jira
(v8.3.2#803003)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]