[
https://issues.apache.org/jira/browse/SPARK-6209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14351093#comment-14351093
]
Apache Spark commented on SPARK-6209:
-------------------------------------
User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/4935
> ExecutorClassLoader can leak connections after failing to load classes from
> the REPL class server
> -------------------------------------------------------------------------------------------------
>
> Key: SPARK-6209
> URL: https://issues.apache.org/jira/browse/SPARK-6209
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 1.0.0, 1.0.3, 1.3.0, 1.1.2, 1.2.1, 1.4.0
> Reporter: Josh Rosen
> Assignee: Josh Rosen
> Priority: Critical
>
> ExecutorClassLoader does not ensure proper cleanup of network connections
> that it opens. If it fails to load a class, it may leak partially-consumed
> InputStreams that are connected to the REPL's HTTP class server, causing that
> server to exhaust its thread pool, which can cause the entire job to hang.
> Here is a simple reproduction:
> With
> {code}
> ./bin/spark-shell --master local-cluster[8,8,512]
> {code}
> run the following command:
> {code}
> sc.parallelize(1 to 1000, 1000).map { x =>
> try {
> Class.forName("some.class.that.does.not.Exist")
> } catch {
> case e: Exception => // do nothing
> }
> x
> }.count()
> {code}
> This job will run 253 tasks, then will completely freeze without any errors
> or failed tasks.
> It looks like the driver has 253 threads blocked in socketRead0() calls:
> {code}
> [joshrosen ~]$ jstack 16765 | grep socketRead0 | wc
> 253 759 14674
> {code}
> e.g.
> {code}
> "qtp1287429402-13" daemon prio=5 tid=0x00007f868a1c0000 nid=0x5b03 runnable
> [0x00000001159bd000]
> java.lang.Thread.State: RUNNABLE
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:152)
> at java.net.SocketInputStream.read(SocketInputStream.java:122)
> at org.eclipse.jetty.io.ByteArrayBuffer.readFrom(ByteArrayBuffer.java:391)
> at org.eclipse.jetty.io.bio.StreamEndPoint.fill(StreamEndPoint.java:141)
> at
> org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.fill(SocketConnector.java:227)
> at org.eclipse.jetty.http.HttpParser.fill(HttpParser.java:1044)
> at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:280)
> at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
> at
> org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
> at
> org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Jstack on the executors shows blocking in loadClass / findClass, where a
> single thread is RUNNABLE and waiting to hear back from the driver and other
> executor threads are BLOCKED on object monitor synchronization at
> Class.forName0().
> Remotely triggering a GC on a hanging executor allows the job to progress and
> complete more tasks before hanging again. If I repeatedly trigger GC on all
> of the executors, then the job runs to completion:
> {code}
> jps | grep CoarseGra | cut -d ' ' -f 1 | xargs -I {} -n 1 -P100 jcmd {} GC.run
> {code}
> The culprit is a {{catch}} block that ignores all exceptions and performs no
> cleanup:
> https://github.com/apache/spark/blob/v1.2.0/repl/src/main/scala/org/apache/spark/repl/ExecutorClassLoader.scala#L94
> This bug has been present since Spark 1.0.0, but I suspect that we haven't
> seen it before because it's pretty hard to reproduce. Triggering this error
> requires a job with tasks that trigger ClassNotFoundExceptions yet are still
> able to run to completion. It also requires that executors are able to leak
> enough open connections to exhaust the class server's Jetty thread pool
> limit, which requires that there are a large number of tasks (253+) and
> either a large number of executors or a very low amount of GC pressure on
> those executors (since GC will cause the leaked connections to be closed).
> The fix here is pretty simple: add proper resource cleanup to this class.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]