BryanCutler commented on a change in pull request #24834: [SPARK-27992][PYTHON]
Synchronize with Python connection thread to propagate errors
URL: https://github.com/apache/spark/pull/24834#discussion_r296829988
##########
File path: python/pyspark/rdd.py
##########
@@ -140,14 +140,29 @@ def _parse_memory(s):
def _create_local_socket(sock_info):
- (sockfile, sock) = local_connect_and_auth(*sock_info)
+ """
+ Create a local socket that can be used to load deserialized data from the
JVM
+
+ :param sock_info: Tuple containing port number and authentication secret
for a local socket.
+ :return: sockfile file descriptor of the local socket
+ """
+ port = sock_info[0]
+ auth_secret = sock_info[1]
+ sockfile, sock = local_connect_and_auth(port, auth_secret)
# The RDD materialization time is unpredictable, if we set a timeout for
socket reading
# operation, it will very possibly fail. See SPARK-18281.
sock.settimeout(None)
return sockfile
def _load_from_socket(sock_info, serializer):
Review comment:
Uggh, yeah I'm not too happy with this. Java returns a 3-tuple with (port,
auth_secret, server) and most places only use the first 2, such as
`_load_from_socket`. It gets a little confusing, so I thought it might be
better to expand the values returned by java for `serveToStream` etc., but it
ended up with a lot of changes where the third value is ignored like this
```python
port, auth_secret, _ = ...
```
and I don't think it really made things clearer. I'll try to think of
something better and maybe do a followup.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]