liuzx32 opened a new issue #10984: Some wrong with training on yarn!? URL: https://github.com/apache/incubator-mxnet/issues/10984 ../../tools/launch.py -n 2 -s 2 --cluster=local python train_mnist.py --network lenet --kv-store dist_sync, run successful. So I adjust --cluster=yarn to mxnet on yarn ../../tools/launch.py -n 2 -s 2 --cluster=yarn python train_mnist.py --network lenet --kv-store dist_sync ## ## But job failed Exit code: 1 18/05/16 12:55:16 INFO dmlc.ApplicationMaster: onContainerStarted Invoked 18/05/16 12:55:16 INFO dmlc.ApplicationMaster: onContainerStarted Invoked 18/05/16 12:55:17 INFO dmlc.ApplicationMaster: [DMLC] Task 0 exited with status 1 Diagnostics:Exception from container-launch. Container id: container_1498108406715_361607_01_000002 Exit code: 1 Stack trace: ExitCodeException exitCode=1: at org.apache.hadoop.util.Shell.runCommand(Shell.java:582) at org.apache.hadoop.util.Shell.run(Shell.java:479) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Container exited with a non-zero exit code 1
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services