[
https://issues.apache.org/jira/browse/SPARK-12617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15086008#comment-15086008
]
Apache Spark commented on SPARK-12617:
--------------------------------------
User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/10621
> socket descriptor leak killing streaming app
> --------------------------------------------
>
> Key: SPARK-12617
> URL: https://issues.apache.org/jira/browse/SPARK-12617
> Project: Spark
> Issue Type: Bug
> Components: PySpark, Streaming
> Affects Versions: 1.5.2
> Environment: pyspark (python 2.6)
> Reporter: Antony Mayi
> Assignee: Shixiong Zhu
> Priority: Critical
> Fix For: 1.5.3, 1.6.1, 2.0.0
>
> Attachments: bug.py
>
>
> There is a socket descriptor leakage in a pyspark streaming app when
> configured with batch interval more then 30 seconds. This is due to default
> timeout in py4j JavaGateway which (half-)closes CallbackConnection after 30
> seconds of inactivity and creates new one next time. That connection doesn't
> get closed on the python CallbackServer side and keep piling up until it
> eventually blocks new connections.
> h2. Steps to reproduce:
> * Submit attached [^bug.py] to spark
> * Watch {{/tmp/bug.log}} to see the increasing total number of py4j callback
> connections of which 0 will ever be closed
> {code}
> [BUG] py4j callback server port: 51282
> [BUG] py4j CB 0/0 closed
> ...
> [BUG] py4j CB 0/123 closed
> {code}
> * You can confirm the reality by using lsof on the pyspark driver process:
> {code}
> $ sudo lsof -p 39770 | grep CLOSE_WAIT | grep :51282
> python2.6 39770 das 94u IPv4 138824906 0t0 TCP
> localhost.localdomain:51282->localhost.localdomain:60419 (CLOSE_WAIT)
> python2.6 39770 das 95u IPv4 138867747 0t0 TCP
> localhost.localdomain:51282->localhost.localdomain:60745 (CLOSE_WAIT)
> python2.6 39770 das 96u IPv4 138831829 0t0 TCP
> localhost.localdomain:51282->localhost.localdomain:32849 (CLOSE_WAIT)
> python2.6 39770 das 97u IPv4 138890524 0t0 TCP
> localhost.localdomain:51282->localhost.localdomain:33184 (CLOSE_WAIT)
> python2.6 39770 das 98u IPv4 138860190 0t0 TCP
> localhost.localdomain:51282->localhost.localdomain:33512 (CLOSE_WAIT)
> python2.6 39770 das 99u IPv4 138860439 0t0 TCP
> localhost.localdomain:51282->localhost.localdomain:33854 (CLOSE_WAIT)
> ...
> {code}
> * If you leave it running for long enough the CallbackServer will eventually
> become unable to accept new connections from the gateway and the app will
> crash:
> {code}
> 16/01/02 05:12:07 ERROR scheduler.JobScheduler: Error generating jobs for
> time 1451711400000 ms
> py4j.Py4JException: Error while obtaining a new communication channel
> ...
> Caused by: java.net.ConnectException: Connection timed out
> at java.net.PlainSocketImpl.socketConnect(Native Method)
> at
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
> at
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
> at
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at java.net.Socket.connect(Socket.java:538)
> at java.net.Socket.<init>(Socket.java:434)
> at java.net.Socket.<init>(Socket.java:244)
> at py4j.CallbackConnection.start(CallbackConnection.java:104)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]