Github user ash211 commented on the pull request:
https://github.com/apache/spark/pull/1107#issuecomment-47497102
With these changes, I think I'm able to make this work for the majority
case.
There will still need to be some work done to handle the situation where a
Spark application is run on the same machine as a worker though. Currently the
application will take the port of an executor and cause the executor on one
server to be on port n+1 while all others are on port n. This occurs for
`spark.fileserver.port` and `spark.blockManager.port`
Fresh cluster, no applications:
```
$ clear; jps | grep -v ' $' | grep -v 'Jps' | sort ; lsof -i 4tcp -P | grep
java | awk '{print "pid", $2, $9}'
85073 Master
85187 Worker
86636 sbt-launch-0.13.2.jar
pid 85073 aash-mbp:7077
pid 85073 *:8080
pid 85073 aash-mbp:7077->aash-mbp:49582
pid 85187 aash-mbp:49581
pid 85187 *:8081
pid 85187 aash-mbp:49582->aash-mbp:7077
```
The connections can be summarized as:
- Worker -> Master:7077
After starting spark-shell like this:
`./bin/spark-shell --master spark://aash-mbp.local:7077
--driver-java-options "-Dspark.fileserver.port=40010
-Dspark.broadcast.port=40020 -Dspark.replClassServer.port=40030
-Dspark.blockManager.port=40040 -Dspark.driver.port=40050"`
```
$ clear; jps | grep -v ' $' | grep -v 'Jps' | sort ; lsof -i 4tcp -P | grep
java | awk '{print "pid", $2, $9}'
85073 Master
85187 Worker
86247 SparkSubmit
86325 CoarseGrainedExecutorBackend
pid 85073 aash-mbp:7077
pid 85073 *:8080
pid 85073 aash-mbp:7077->aash-mbp:49582
pid 85073 aash-mbp:7077->aash-mbp:49932
pid 85187 aash-mbp:49581
pid 85187 *:8081
pid 85187 aash-mbp:49582->aash-mbp:7077
pid 85187 aash-mbp:49581->aash-mbp:49935
pid 86247 *:40030
pid 86247 aash-mbp:40050
pid 86247 *:40040
pid 86247 *:40020
pid 86247 *:40010
pid 86247 *:4040
pid 86247 aash-mbp:49932->aash-mbp:7077
pid 86247 aash-mbp:40050->aash-mbp:49934
pid 86247 aash-mbp:40050->aash-mbp:49937
pid 86325 aash-mbp:49933
pid 86325 aash-mbp:49934->aash-mbp:40050
pid 86325 aash-mbp:49935->aash-mbp:49581
pid 86325 aash-mbp:49936
pid 86325 aash-mbp:49937->aash-mbp:40050
pid 86325 *:40041
pid 86325 *:40011
```
These are summarized as:
- Worker:eph -> Master:7077
- SparkSubmit:eph -> Master:7077
- Executor:eph -> SparkSubmit:40050
- Executor:eph -> SparkSubmit:40050 (again?)
- Worker:eph -> Executor:eph
The outstanding things I'd like to do for this still are:
- figure out a way to get the conflict of app vs executor on the same
server figured out. I think the right way to do this might be to have separate
config settings: `spark.fileserver.app.port` and
`spark.fileserver.executor.port`, and the same for `spark.blockManager.port`.
This would allow ops teams to configure the network activity on the cluster and
executor ports separately from dev teams running jobs on the cluster.
- better logging for port failover, ideally specific to the service name
(turn "failed to start service on port X" into "failed to start block manager
on port X")
And lower priority:
- figure out why two connections are made from executor -> app on the app's
`spark.driver.port` rather than just one
Because all the ports being listened on ("*:40010" etc) are specified now
though I think this is moving closer to being ready.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---