[ https://issues.apache.org/jira/browse/SPARK-12617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15086008#comment-15086008 ]
Apache Spark commented on SPARK-12617: -------------------------------------- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/10621 > socket descriptor leak killing streaming app > -------------------------------------------- > > Key: SPARK-12617 > URL: https://issues.apache.org/jira/browse/SPARK-12617 > Project: Spark > Issue Type: Bug > Components: PySpark, Streaming > Affects Versions: 1.5.2 > Environment: pyspark (python 2.6) > Reporter: Antony Mayi > Assignee: Shixiong Zhu > Priority: Critical > Fix For: 1.5.3, 1.6.1, 2.0.0 > > Attachments: bug.py > > > There is a socket descriptor leakage in a pyspark streaming app when > configured with batch interval more then 30 seconds. This is due to default > timeout in py4j JavaGateway which (half-)closes CallbackConnection after 30 > seconds of inactivity and creates new one next time. That connection doesn't > get closed on the python CallbackServer side and keep piling up until it > eventually blocks new connections. > h2. Steps to reproduce: > * Submit attached [^bug.py] to spark > * Watch {{/tmp/bug.log}} to see the increasing total number of py4j callback > connections of which 0 will ever be closed > {code} > [BUG] py4j callback server port: 51282 > [BUG] py4j CB 0/0 closed > ... > [BUG] py4j CB 0/123 closed > {code} > * You can confirm the reality by using lsof on the pyspark driver process: > {code} > $ sudo lsof -p 39770 | grep CLOSE_WAIT | grep :51282 > python2.6 39770 das 94u IPv4 138824906 0t0 TCP > localhost.localdomain:51282->localhost.localdomain:60419 (CLOSE_WAIT) > python2.6 39770 das 95u IPv4 138867747 0t0 TCP > localhost.localdomain:51282->localhost.localdomain:60745 (CLOSE_WAIT) > python2.6 39770 das 96u IPv4 138831829 0t0 TCP > localhost.localdomain:51282->localhost.localdomain:32849 (CLOSE_WAIT) > python2.6 39770 das 97u IPv4 138890524 0t0 TCP > localhost.localdomain:51282->localhost.localdomain:33184 (CLOSE_WAIT) > python2.6 39770 das 98u IPv4 138860190 0t0 TCP > localhost.localdomain:51282->localhost.localdomain:33512 (CLOSE_WAIT) > python2.6 39770 das 99u IPv4 138860439 0t0 TCP > localhost.localdomain:51282->localhost.localdomain:33854 (CLOSE_WAIT) > ... > {code} > * If you leave it running for long enough the CallbackServer will eventually > become unable to accept new connections from the gateway and the app will > crash: > {code} > 16/01/02 05:12:07 ERROR scheduler.JobScheduler: Error generating jobs for > time 1451711400000 ms > py4j.Py4JException: Error while obtaining a new communication channel > ... > Caused by: java.net.ConnectException: Connection timed out > at java.net.PlainSocketImpl.socketConnect(Native Method) > at > java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) > at > java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) > at > java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) > at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) > at java.net.Socket.connect(Socket.java:589) > at java.net.Socket.connect(Socket.java:538) > at java.net.Socket.<init>(Socket.java:434) > at java.net.Socket.<init>(Socket.java:244) > at py4j.CallbackConnection.start(CallbackConnection.java:104) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org