Tian Gao created SPARK-54247:
--------------------------------
Summary: `pyspark.util._load_from_socket` is leaking socket
connectino
Key: SPARK-54247
URL: https://issues.apache.org/jira/browse/SPARK-54247
Project: Spark
Issue Type: Improvement
Components: PySpark
Affects Versions: 4.2.0
Reporter: Tian Gao
We create a `sockfile` in `pyspark.util._load_from_socket` and hope that it is
automatically closed during gc (as it is mentioned in the comment). Even though
the socket is indeed closed during gc, there are a few issues:
# There will be a warning generated. I believe pyspark redirects the warning
somewhere but it's still generated. If you run tests with `python -m unittest`
your screen will be flushed with the warning.
# gc is not always super reliable. There's a quote from Zen of Python -
explicit is better than implicit. We should try to control the life cycle of
the socket. Having a unused socket is not ideal.
The reason we left the socket open is we want to be lazy. `_load_from_socket`
returns an iterator, rather than a the whole data. However, in all use cases
except one, we drain that iterator immediately.
So I propose to change this in a more explicit way. I think the best way to do
it is to change the function to be a context. We always close the socket after
it's being used. The code would be relatively clean too:
{code:java}
# From
return list(_load_from_socket(sock_info, self._jrdd_deserializer))
# To
with _load_from_socket(sock_info, self._jrdd_deserializer) as stream:
return list(stream){code}
We can't keep the backwards compatibility. for
`pyspark.util._load_from_socket`, which is an internal private function anyway.
If that's an issue, we can have a new function to do this.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]