Github user bersprockets commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20424#discussion_r164855511
  
    --- Diff: 
core/src/main/scala/org/apache/spark/api/python/PythonWorkerFactory.scala ---
    @@ -191,7 +191,20 @@ private[spark] class PythonWorkerFactory(pythonExec: 
String, envVars: Map[String
             daemon = pb.start()
     
             val in = new DataInputStream(daemon.getInputStream)
    -        daemonPort = in.readInt()
    +        try {
    +          daemonPort = in.readInt()
    +        } catch {
    +          case exc: EOFException =>
    +            throw new IOException(s"No port number in $daemonModule's 
stdout")
    +        }
    +
    +        // test that the returned port number is within a valid range.
    +        // note: this does not cover the case where the port number
    +        // is arbitrary data but is also coincidentally within range
    +        if (daemonPort < 1 || daemonPort > 0xffff) {
    --- End diff --
    
    Oh, I see. Let me address the two parts of the comment separately:
    
    First part: What's the point of throwing an exception for a bad port number 
when the original handling did that already?
    
    The original handling was:
    
    <pre>
     java.lang.IllegalArgumentException: port out of range:1315905645
            at java.net.InetSocketAddress.checkPort(InetSocketAddress.java:143)
    </pre>
    
    This error occurred in a different function than the one that obtained the 
port number. So you had to track down the source of the port number. This 
actually added an extra step to the original debugging for the sitecustomize.py 
issue.
    
    The proposed handling is:
    
    <pre>
    java.io.IOException: Bad port number in pyspark.daemon's stdout: 0x4e6f206d
            at 
org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:205)
    </pre>
    
    Here we're not saying "somehow we got a bad port number", but that a 
particular python module (the name of which is displayed, since the name of 
that module is configurable) has returned bad data.
    
    Also, the check is at the point at which the port number is obtained, not 
waiting until sometime later in another function where the port number is used 
(and where possibly something else has changed the port number, which is not 
likely, but you would need to check that).
    
    Perhaps the message could be better, e.g.,:
    
    <pre>
    Bad data in pyspark.daemon's output.
    Expected valid port number, got 0x4e6f206d.
    PYTHONPATH set to 
'/Users/brobbins/github/spark_fork/python/lib/pyspark.zip:/Users/brobbins/github/spark_fork/python/lib/py4j-0.10.6-src.zip:/Users/brobbins/github/spark_fork/assembly/target/scala-2.11/jars/spark-core_2.11-2.4.0-SNAPSHOT.jar:/Users/brobbins/github/spark_fork/python/lib/py4j-0.10.6-src.zip:/Users/brobbins/github/spark_fork/python/:'
    Command to run python module was 'python -m pyspark.daemon'
    Check whether you have a sitecustomize.py module that may be printing 
output to stdout.
    </pre>
    
    Second part: Why don't we fix this?
    
    That's reasonable. A couple of points:
    
    - The name of the python daemon module is now configurable so that the 
module can be wrapped with customizations. It appears that this is only in the 
main branch and not even released on Spark 2.3, so it might be safe to change 
daemon.py (and potentially its existing wrappers) to return the port number in 
a different way.
    - I don't have a good feel for how often sitecustomize.py is used, so not 
sure of the relative value of some mild hacking up of this code vs. just 
letting the user know what happened.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to