Github user bersprockets commented on a diff in the pull request:
https://github.com/apache/spark/pull/20424#discussion_r164855511
--- Diff:
core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala ---
@@ -191,7 +191,20 @@ private[spark] class PythonWorkerFactory(pythonExec:
String, envVars: Map[String
daemon = pb.start()
val in = new DataInputStream(daemon.getInputStream)
- daemonPort = in.readInt()
+ try {
+ daemonPort = in.readInt()
+ } catch {
+ case exc: EOFException =>
+ throw new IOException(s"No port number in $daemonModule's
stdout")
+ }
+
+ // test that the returned port number is within a valid range.
+ // note: this does not cover the case where the port number
+ // is arbitrary data but is also coincidentally within range
+ if (daemonPort < 1 || daemonPort > 0xffff) {
--- End diff --
Oh, I see. Let me address the two parts of the comment separately:
First part: What's the point of throwing an exception for a bad port number
when the original handling did that already?
The original handling was:
<pre>
java.lang.IllegalArgumentException: port out of range:1315905645
at java.net.InetSocketAddress.checkPort(InetSocketAddress.java:143)
</pre>
This error occurred in a different function than the one that obtained the
port number. So you had to track down the source of the port number. This
actually added an extra step to the original debugging for the sitecustomize.py
issue.
The proposed handling is:
<pre>
java.io.IOException: Bad port number in pyspark.daemon's stdout: 0x4e6f206d
at
org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:205)
</pre>
Here we're not saying "somehow we got a bad port number", but that a
particular python module (the name of which is displayed, since the name of
that module is configurable) has returned bad data.
Also, the check is at the point at which the port number is obtained, not
waiting until sometime later in another function where the port number is used
(and where possibly something else has changed the port number, which is not
likely, but you would need to check that).
Perhaps the message could be better, e.g.,:
<pre>
Bad data in pyspark.daemon's output.
Expected valid port number, got 0x4e6f206d.
PYTHONPATH set to
'/Users/brobbins/github/spark_fork/python/lib/pyspark.zip:/Users/brobbins/github/spark_fork/python/lib/py4j-0.10.6-src.zip:/Users/brobbins/github/spark_fork/assembly/target/scala-2.11/jars/spark-core_2.11-2.4.0-SNAPSHOT.jar:/Users/brobbins/github/spark_fork/python/lib/py4j-0.10.6-src.zip:/Users/brobbins/github/spark_fork/python/:'
Command to run python module was 'python -m pyspark.daemon'
Check whether you have a sitecustomize.py module that may be printing
output to stdout.
</pre>
Second part: Why don't we fix this?
That's reasonable. A couple of points:
- The name of the python daemon module is now configurable so that the
module can be wrapped with customizations. It appears that this is only in the
main branch and not even released on Spark 2.3, so it might be safe to change
daemon.py (and potentially its existing wrappers) to return the port number in
a different way.
- I don't have a good feel for how often sitecustomize.py is used, so not
sure of the relative value of some mild hacking up of this code vs. just
letting the user know what happened.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]