shanyu opened a new pull request #25705: SPARK-29003: Spark history server startup hang due to deadlock URL: https://github.com/apache/spark/pull/25705 ### What changes were proposed in this pull request? This fixes JIRA: https://issues.apache.org/jira/browse/SPARK-29003 The problem is that during Spark History Server startup, there are two things happening simultaneously that call into java.nio.file.FileSystems.getDefault(): 1) start jetty server 2) start ApplicationHistoryProvider (which reads files from HDFS) We should do these two things sequentially instead of in parallel. We introduce a start() method in ApplicationHistoryProvider (and its subclass FsHistoryProvider), and we do initialize inside the start() method instead of the constructor. In HistoryServer, we explicitly call provider.start() after we call bind() which starts the Jetty server. ### Why are the changes needed? It is a bug that occasionally starting Spark History Server results in process hang due to deadlock among threads. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? I stress tested this PR with a bash script to stop and start Spark History Server more than 1000 times, it worked fine. Previously I can only do the stop/start loop less than 10 times before I hit the deadlock issue.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
