Github user ash211 commented on the pull request:

    https://github.com/apache/spark/pull/1107#issuecomment-47497102
  
    With these changes, I think I'm able to make this work for the majority 
case.
    
    There will still need to be some work done to handle the situation where a 
Spark application is run on the same machine as a worker though.  Currently the 
application will take the port of an executor and cause the executor on one 
server to be on port n+1 while all others are on port n.  This occurs for 
`spark.fileserver.port` and `spark.blockManager.port`
    
    
    Fresh cluster, no applications:
    ```
    $ clear; jps | grep -v ' $' | grep -v 'Jps' | sort ; lsof -i 4tcp -P | grep 
java | awk '{print "pid", $2, $9}'
    85073 Master
    85187 Worker
    86636 sbt-launch-0.13.2.jar
    pid 85073 aash-mbp:7077
    pid 85073 *:8080
    pid 85073 aash-mbp:7077->aash-mbp:49582
    pid 85187 aash-mbp:49581
    pid 85187 *:8081
    pid 85187 aash-mbp:49582->aash-mbp:7077
    ```
    
    The connections can be summarized as:
     - Worker -> Master:7077
    
    After starting spark-shell like this:
    `./bin/spark-shell --master spark://aash-mbp.local:7077 
--driver-java-options "-Dspark.fileserver.port=40010 
-Dspark.broadcast.port=40020 -Dspark.replClassServer.port=40030 
-Dspark.blockManager.port=40040 -Dspark.driver.port=40050"`
    
    ```
    $ clear; jps | grep -v ' $' | grep -v 'Jps' | sort ; lsof -i 4tcp -P | grep 
java | awk '{print "pid", $2, $9}'
    85073 Master
    85187 Worker
    86247 SparkSubmit
    86325 CoarseGrainedExecutorBackend
    pid 85073 aash-mbp:7077
    pid 85073 *:8080
    pid 85073 aash-mbp:7077->aash-mbp:49582
    pid 85073 aash-mbp:7077->aash-mbp:49932
    pid 85187 aash-mbp:49581
    pid 85187 *:8081
    pid 85187 aash-mbp:49582->aash-mbp:7077
    pid 85187 aash-mbp:49581->aash-mbp:49935
    pid 86247 *:40030
    pid 86247 aash-mbp:40050
    pid 86247 *:40040
    pid 86247 *:40020
    pid 86247 *:40010
    pid 86247 *:4040
    pid 86247 aash-mbp:49932->aash-mbp:7077
    pid 86247 aash-mbp:40050->aash-mbp:49934
    pid 86247 aash-mbp:40050->aash-mbp:49937
    pid 86325 aash-mbp:49933
    pid 86325 aash-mbp:49934->aash-mbp:40050
    pid 86325 aash-mbp:49935->aash-mbp:49581
    pid 86325 aash-mbp:49936
    pid 86325 aash-mbp:49937->aash-mbp:40050
    pid 86325 *:40041
    pid 86325 *:40011
    ```
    
    These are summarized as:
     - Worker:eph -> Master:7077
     - SparkSubmit:eph -> Master:7077
     - Executor:eph -> SparkSubmit:40050
     - Executor:eph -> SparkSubmit:40050 (again?)
     - Worker:eph -> Executor:eph
    
    
    The outstanding things I'd like to do for this still are:
    - figure out a way to get the conflict of app vs executor on the same 
server figured out.  I think the right way to do this might be to have separate 
config settings: `spark.fileserver.app.port` and 
`spark.fileserver.executor.port`, and the same for `spark.blockManager.port`.  
This would allow ops teams to configure the network activity on the cluster and 
executor ports separately from dev teams running jobs on the cluster.
    - better logging for port failover, ideally specific to the service name 
(turn "failed to start service on port X" into "failed to start block manager 
on port X")
    
    And lower priority:
    - figure out why two connections are made from executor -> app on the app's 
`spark.driver.port` rather than just one
    
    Because all the ports being listened on ("*:40010" etc) are specified now 
though I think this is moving closer to being ready.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to