[jira] [Updated] (SPARK-54247) `pyspark.util._load_from_socket` is leaking socket connectino

Tian Gao (Jira) Fri, 07 Nov 2025 19:53:50 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-54247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tian Gao updated SPARK-54247:
-----------------------------
    Description: 
We create a `sockfile` in `pyspark.util._load_from_socket` and hope that it is 
automatically closed during gc (as it is mentioned in the comment). Even though 
the socket is indeed closed during gc, there are a few issues:
 # There will be a warning generated. I believe pyspark redirects the warning 
somewhere but it's still generated. If you run tests with `python -m unittest` 
your screen will be flushed with the warning.
 # gc is not always super reliable. There's a quote from Zen of Python - 
explicit is better than implicit. We should try to control the life cycle of 
the socket. Having a unused socket is not ideal.

The reason we left the socket open is we want to be lazy. `_load_from_socket` 
returns an iterator, rather than a the whole data. However, in all use cases 
except one, we drain that iterator immediately.

So I propose to change this in a more explicit way. I think the best way to do 
it is to change the function to be a context. We always close the socket after 
it's being used. The code would be relatively clean too:
{code:java}
# From 
return list(_load_from_socket(sock_info, self._jrdd_deserializer))
# To
with _load_from_socket(sock_info, self._jrdd_deserializer) as stream:
    return list(stream){code}
We can't keep the backwards compatibility. for 
`pyspark.util._load_from_socket`, which is an internal private function anyway. 
If that's an issue, we can have a new function to do this.

 

  was:
We create a `sockfile` in `pyspark.util._load_from_socket` and hope that it is 
automatically closed during gc (as it is mentioned in the comment). Even though 
the socket is indeed closed during gc, there are a few issues:
 # There will be a warning generated. I believe pyspark redirects the warning 
somewhere but it's still generated. If you run tests with `python -m unittest` 
your screen will be flushed with the warning.
 # gc is not always super reliable. There's a quote from Zen of Python - 
explicit is better than implicit. We should try to control the life cycle of 
the socket. Having a unused socket is not ideal.

The reason we left the socket open is we want to be lazy. `_load_from_socket` 
returns an iterator, rather than a the whole data. However, in all use cases 
except one, we drain that iterator immediately.

So I propose to change this in a more explicit way. I think the best way to do 
it is to change the function to be a context. We always close the socket after 
it's being used. The code would be relatively clean too:

 
{code:java}
# From 
return list(_load_from_socket(sock_info, self._jrdd_deserializer))
# To
with _load_from_socket(sock_info, self._jrdd_deserializer) as stream:
    return list(stream){code}
We can't keep the backwards compatibility. for 
`pyspark.util._load_from_socket`, which is an internal private function anyway. 
If that's an issue, we can have a new function to do this.

 


> `pyspark.util._load_from_socket` is leaking socket connectino
> -------------------------------------------------------------
>
>                 Key: SPARK-54247
>                 URL: https://issues.apache.org/jira/browse/SPARK-54247
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 4.2.0
>            Reporter: Tian Gao
>            Priority: Minor
>
> We create a `sockfile` in `pyspark.util._load_from_socket` and hope that it 
> is automatically closed during gc (as it is mentioned in the comment). Even 
> though the socket is indeed closed during gc, there are a few issues:
>  # There will be a warning generated. I believe pyspark redirects the warning 
> somewhere but it's still generated. If you run tests with `python -m 
> unittest` your screen will be flushed with the warning.
>  # gc is not always super reliable. There's a quote from Zen of Python - 
> explicit is better than implicit. We should try to control the life cycle of 
> the socket. Having a unused socket is not ideal.
> The reason we left the socket open is we want to be lazy. `_load_from_socket` 
> returns an iterator, rather than a the whole data. However, in all use cases 
> except one, we drain that iterator immediately.
> So I propose to change this in a more explicit way. I think the best way to 
> do it is to change the function to be a context. We always close the socket 
> after it's being used. The code would be relatively clean too:
> {code:java}
> # From 
> return list(_load_from_socket(sock_info, self._jrdd_deserializer))
> # To
> with _load_from_socket(sock_info, self._jrdd_deserializer) as stream:
>     return list(stream){code}
> We can't keep the backwards compatibility. for 
> `pyspark.util._load_from_socket`, which is an internal private function 
> anyway. If that's an issue, we can have a new function to do this.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-54247) `pyspark.util._load_from_socket` is leaking socket connectino

Reply via email to