GitHub user jiangxb1987 opened a pull request:

    https://github.com/apache/spark/pull/18290

    [SPARK-20989][Core] Fail to start multiple workers on one host if external 
shuffle service is enabled in standalone mode

    ## What changes were proposed in this pull request?
    
    In standalone mode, if we enable external shuffle service by setting 
`spark.shuffle.service.enabled` to true, and then we try to start multiple 
workers on one host(by setting `SPARK_WORKER_INSTANCES=3` in spark-env.sh, and 
then run `sbin/start-slaves.sh`), we can only launch one worker on each host 
successfully and the rest of the workers fail to launch.
    The reason is the port of external shuffle service if configed by 
`spark.shuffle.service.port`, so currently we could start no more than one 
external shuffle service on each host. In our case, each worker tries to start 
a external shuffle service, and only one of them succeeded doing this.
    
    We should give explicit reason of failure instead of fail silently.
    
    ## How was this patch tested?
    Manually test by the following steps:
    1. SET `SPARK_WORKER_INSTANCES=1` in `conf/spark-env.sh`;
    2. SET `spark.shuffle.service.enabled` to `true` in 
`conf/spark-defaults.conf`;
    3. Run `sbin/start-all.sh`.
    
    Before the change, you will see no error in the command line, as the 
following:
    ```
    starting org.apache.spark.deploy.master.Master, logging to 
/Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.master.Master-1-xxx.local.out
    localhost: starting org.apache.spark.deploy.worker.Worker, logging to 
/Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-1-xxx.local.out
    localhost: starting org.apache.spark.deploy.worker.Worker, logging to 
/Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-2-xxx.local.out
    localhost: starting org.apache.spark.deploy.worker.Worker, logging to 
/Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-3-xxx.local.out
    ```
    And you can see in the webUI that only one worker is running.
    
    After the change, you get explicit error messages in the command line:
    ```
    starting org.apache.spark.deploy.master.Master, logging to 
/Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.master.Master-1-xxx.local.out
    localhost: starting org.apache.spark.deploy.worker.Worker, logging to 
/Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-1-xxx.local.out
    localhost: failed to launch: nice -n 0 
/Users/xxx/workspace/spark/bin/spark-class 
org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://xxx.local:7077
    localhost:   17/06/13 23:24:53 INFO SecurityManager: Changing view acls to: 
xxx
    localhost:   17/06/13 23:24:53 INFO SecurityManager: Changing modify acls 
to: xxx
    localhost:   17/06/13 23:24:53 INFO SecurityManager: Changing view acls 
groups to: 
    localhost:   17/06/13 23:24:53 INFO SecurityManager: Changing modify acls 
groups to: 
    localhost:   17/06/13 23:24:53 INFO SecurityManager: SecurityManager: 
authentication disabled; ui acls disabled; users  with view permissions: 
Set(xxx); groups with view permissions: Set(); users  with modify permissions: 
Set(xxx); groups with modify permissions: Set()
    localhost:   17/06/13 23:24:54 INFO Utils: Successfully started service 
'sparkWorker' on port 63354.
    localhost:   Exception in thread "main" java.lang.IllegalArgumentException: 
requirement failed: Start multiple worker on one host failed because we may 
launch no more than one external shuffle service on each host, please set 
spark.shuffle.service.enabled to false or set SPARK_WORKER_INSTANCES to 1 to 
resolve the conflict.
    localhost:          at scala.Predef$.require(Predef.scala:224)
    localhost:          at 
org.apache.spark.deploy.worker.Worker$.main(Worker.scala:752)
    localhost:          at 
org.apache.spark.deploy.worker.Worker.main(Worker.scala)
    localhost: full log in 
/Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-1-xxx.local.out
    localhost: starting org.apache.spark.deploy.worker.Worker, logging to 
/Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-2-xxx.local.out
    localhost: failed to launch: nice -n 0 
/Users/xxx/workspace/spark/bin/spark-class 
org.apache.spark.deploy.worker.Worker --webui-port 8082 spark://xxx.local:7077
    localhost:   17/06/13 23:24:56 INFO SecurityManager: Changing view acls to: 
xxx
    localhost:   17/06/13 23:24:56 INFO SecurityManager: Changing modify acls 
to: xxx
    localhost:   17/06/13 23:24:56 INFO SecurityManager: Changing view acls 
groups to: 
    localhost:   17/06/13 23:24:56 INFO SecurityManager: Changing modify acls 
groups to: 
    localhost:   17/06/13 23:24:56 INFO SecurityManager: SecurityManager: 
authentication disabled; ui acls disabled; users  with view permissions: 
Set(xxx); groups with view permissions: Set(); users  with modify permissions: 
Set(xxx); groups with modify permissions: Set()
    localhost:   17/06/13 23:24:56 INFO Utils: Successfully started service 
'sparkWorker' on port 63359.
    localhost:   Exception in thread "main" java.lang.IllegalArgumentException: 
requirement failed: Start multiple worker on one host failed because we may 
launch no more than one external shuffle service on each host, please set 
spark.shuffle.service.enabled to false or set SPARK_WORKER_INSTANCES to 1 to 
resolve the conflict.
    localhost:          at scala.Predef$.require(Predef.scala:224)
    localhost:          at 
org.apache.spark.deploy.worker.Worker$.main(Worker.scala:752)
    localhost:          at 
org.apache.spark.deploy.worker.Worker.main(Worker.scala)
    localhost: full log in 
/Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-2-xxx.local.out
    localhost: starting org.apache.spark.deploy.worker.Worker, logging to 
/Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-3-xxx.local.out
    localhost: failed to launch: nice -n 0 
/Users/xxx/workspace/spark/bin/spark-class 
org.apache.spark.deploy.worker.Worker --webui-port 8083 spark://xxx.local:7077
    localhost:   17/06/13 23:24:59 INFO SecurityManager: Changing view acls to: 
xxx
    localhost:   17/06/13 23:24:59 INFO SecurityManager: Changing modify acls 
to: xxx
    localhost:   17/06/13 23:24:59 INFO SecurityManager: Changing view acls 
groups to: 
    localhost:   17/06/13 23:24:59 INFO SecurityManager: Changing modify acls 
groups to: 
    localhost:   17/06/13 23:24:59 INFO SecurityManager: SecurityManager: 
authentication disabled; ui acls disabled; users  with view permissions: 
Set(xxx); groups with view permissions: Set(); users  with modify permissions: 
Set(xxx); groups with modify permissions: Set()
    localhost:   17/06/13 23:24:59 INFO Utils: Successfully started service 
'sparkWorker' on port 63360.
    localhost:   Exception in thread "main" java.lang.IllegalArgumentException: 
requirement failed: Start multiple worker on one host failed because we may 
launch no more than one external shuffle service on each host, please set 
spark.shuffle.service.enabled to false or set SPARK_WORKER_INSTANCES to 1 to 
resolve the conflict.
    localhost:          at scala.Predef$.require(Predef.scala:224)
    localhost:          at 
org.apache.spark.deploy.worker.Worker$.main(Worker.scala:752)
    localhost:          at 
org.apache.spark.deploy.worker.Worker.main(Worker.scala)
    localhost: full log in 
/Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-3-xxx.local.out
    ```

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jiangxb1987/spark start-slave

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18290.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18290
    
----
commit e22cd94a7645d8437b66c13e834df9f168fbc694
Author: Xingbo Jiang <[email protected]>
Date:   2017-06-13T14:48:10Z

    update error message

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to