GitHub user jiangxb1987 opened a pull request:
https://github.com/apache/spark/pull/18290
[SPARK-20989][Core] Fail to start multiple workers on one host if external
shuffle service is enabled in standalone mode
## What changes were proposed in this pull request?
In standalone mode, if we enable external shuffle service by setting
`spark.shuffle.service.enabled` to true, and then we try to start multiple
workers on one host(by setting `SPARK_WORKER_INSTANCES=3` in spark-env.sh, and
then run `sbin/start-slaves.sh`), we can only launch one worker on each host
successfully and the rest of the workers fail to launch.
The reason is the port of external shuffle service if configed by
`spark.shuffle.service.port`, so currently we could start no more than one
external shuffle service on each host. In our case, each worker tries to start
a external shuffle service, and only one of them succeeded doing this.
We should give explicit reason of failure instead of fail silently.
## How was this patch tested?
Manually test by the following steps:
1. SET `SPARK_WORKER_INSTANCES=1` in `conf/spark-env.sh`;
2. SET `spark.shuffle.service.enabled` to `true` in
`conf/spark-defaults.conf`;
3. Run `sbin/start-all.sh`.
Before the change, you will see no error in the command line, as the
following:
```
starting org.apache.spark.deploy.master.Master, logging to
/Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.master.Master-1-xxx.local.out
localhost: starting org.apache.spark.deploy.worker.Worker, logging to
/Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-1-xxx.local.out
localhost: starting org.apache.spark.deploy.worker.Worker, logging to
/Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-2-xxx.local.out
localhost: starting org.apache.spark.deploy.worker.Worker, logging to
/Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-3-xxx.local.out
```
And you can see in the webUI that only one worker is running.
After the change, you get explicit error messages in the command line:
```
starting org.apache.spark.deploy.master.Master, logging to
/Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.master.Master-1-xxx.local.out
localhost: starting org.apache.spark.deploy.worker.Worker, logging to
/Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-1-xxx.local.out
localhost: failed to launch: nice -n 0
/Users/xxx/workspace/spark/bin/spark-class
org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://xxx.local:7077
localhost: 17/06/13 23:24:53 INFO SecurityManager: Changing view acls to:
xxx
localhost: 17/06/13 23:24:53 INFO SecurityManager: Changing modify acls
to: xxx
localhost: 17/06/13 23:24:53 INFO SecurityManager: Changing view acls
groups to:
localhost: 17/06/13 23:24:53 INFO SecurityManager: Changing modify acls
groups to:
localhost: 17/06/13 23:24:53 INFO SecurityManager: SecurityManager:
authentication disabled; ui acls disabled; users with view permissions:
Set(xxx); groups with view permissions: Set(); users with modify permissions:
Set(xxx); groups with modify permissions: Set()
localhost: 17/06/13 23:24:54 INFO Utils: Successfully started service
'sparkWorker' on port 63354.
localhost: Exception in thread "main" java.lang.IllegalArgumentException:
requirement failed: Start multiple worker on one host failed because we may
launch no more than one external shuffle service on each host, please set
spark.shuffle.service.enabled to false or set SPARK_WORKER_INSTANCES to 1 to
resolve the conflict.
localhost: at scala.Predef$.require(Predef.scala:224)
localhost: at
org.apache.spark.deploy.worker.Worker$.main(Worker.scala:752)
localhost: at
org.apache.spark.deploy.worker.Worker.main(Worker.scala)
localhost: full log in
/Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-1-xxx.local.out
localhost: starting org.apache.spark.deploy.worker.Worker, logging to
/Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-2-xxx.local.out
localhost: failed to launch: nice -n 0
/Users/xxx/workspace/spark/bin/spark-class
org.apache.spark.deploy.worker.Worker --webui-port 8082 spark://xxx.local:7077
localhost: 17/06/13 23:24:56 INFO SecurityManager: Changing view acls to:
xxx
localhost: 17/06/13 23:24:56 INFO SecurityManager: Changing modify acls
to: xxx
localhost: 17/06/13 23:24:56 INFO SecurityManager: Changing view acls
groups to:
localhost: 17/06/13 23:24:56 INFO SecurityManager: Changing modify acls
groups to:
localhost: 17/06/13 23:24:56 INFO SecurityManager: SecurityManager:
authentication disabled; ui acls disabled; users with view permissions:
Set(xxx); groups with view permissions: Set(); users with modify permissions:
Set(xxx); groups with modify permissions: Set()
localhost: 17/06/13 23:24:56 INFO Utils: Successfully started service
'sparkWorker' on port 63359.
localhost: Exception in thread "main" java.lang.IllegalArgumentException:
requirement failed: Start multiple worker on one host failed because we may
launch no more than one external shuffle service on each host, please set
spark.shuffle.service.enabled to false or set SPARK_WORKER_INSTANCES to 1 to
resolve the conflict.
localhost: at scala.Predef$.require(Predef.scala:224)
localhost: at
org.apache.spark.deploy.worker.Worker$.main(Worker.scala:752)
localhost: at
org.apache.spark.deploy.worker.Worker.main(Worker.scala)
localhost: full log in
/Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-2-xxx.local.out
localhost: starting org.apache.spark.deploy.worker.Worker, logging to
/Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-3-xxx.local.out
localhost: failed to launch: nice -n 0
/Users/xxx/workspace/spark/bin/spark-class
org.apache.spark.deploy.worker.Worker --webui-port 8083 spark://xxx.local:7077
localhost: 17/06/13 23:24:59 INFO SecurityManager: Changing view acls to:
xxx
localhost: 17/06/13 23:24:59 INFO SecurityManager: Changing modify acls
to: xxx
localhost: 17/06/13 23:24:59 INFO SecurityManager: Changing view acls
groups to:
localhost: 17/06/13 23:24:59 INFO SecurityManager: Changing modify acls
groups to:
localhost: 17/06/13 23:24:59 INFO SecurityManager: SecurityManager:
authentication disabled; ui acls disabled; users with view permissions:
Set(xxx); groups with view permissions: Set(); users with modify permissions:
Set(xxx); groups with modify permissions: Set()
localhost: 17/06/13 23:24:59 INFO Utils: Successfully started service
'sparkWorker' on port 63360.
localhost: Exception in thread "main" java.lang.IllegalArgumentException:
requirement failed: Start multiple worker on one host failed because we may
launch no more than one external shuffle service on each host, please set
spark.shuffle.service.enabled to false or set SPARK_WORKER_INSTANCES to 1 to
resolve the conflict.
localhost: at scala.Predef$.require(Predef.scala:224)
localhost: at
org.apache.spark.deploy.worker.Worker$.main(Worker.scala:752)
localhost: at
org.apache.spark.deploy.worker.Worker.main(Worker.scala)
localhost: full log in
/Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-3-xxx.local.out
```
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/jiangxb1987/spark start-slave
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/18290.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #18290
----
commit e22cd94a7645d8437b66c13e834df9f168fbc694
Author: Xingbo Jiang <[email protected]>
Date: 2017-06-13T14:48:10Z
update error message
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]