[
https://issues.apache.org/jira/browse/SUBMARINE-457?focusedWorklogId=422011&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-422011
]
ASF GitHub Bot logged work on SUBMARINE-457:
--------------------------------------------
Author: ASF GitHub Bot
Created on: 14/Apr/20 12:52
Start Date: 14/Apr/20 12:52
Worklog Time Spent: 10m
Work Description: lowc1012 commented on pull request #262: SUBMARINE-457.
Run TF MNIST example using Docker Container failed in mini-submarine
URL: https://github.com/apache/submarine/pull/262
### What is this PR for?
The error "javax.security.auth.login.LoginException:
java.lang.NullPointerException: invalid null input: name.." is due to user's
UID of NodeManager host and container user are mismatch.
So I think we only need to improve some documents.
### What type of PR is it?
[Bug Fix]
### Todos
### What is the Jira issue?
[SUBMARINE-457](https://issues.apache.org/jira/projects/SUBMARINE/issues/SUBMARINE-457)
### How should this be tested?
[passed CI](https://travis-ci.org/github/lowc1012/submarine/builds/674784433)
### Screenshots (if appropriate)
### Questions:
* Does the licenses files need update? No
* Is there breaking changes for older versions? No
* Does this needs documentation? No
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 422011)
Remaining Estimate: 0h
Time Spent: 10m
> Run TF MNIST example using Docker Container failed in mini-submarine
> ---------------------------------------------------------------------
>
> Key: SUBMARINE-457
> URL: https://issues.apache.org/jira/browse/SUBMARINE-457
> Project: Apache Submarine
> Issue Type: Bug
> Components: Mini Submarine
> Affects Versions: 0.4.0
> Reporter: Ryan Lo
> Assignee: Ryan Lo
> Priority: Major
> Labels: pull-request-available
> Time Spent: 10m
> Remaining Estimate: 0h
>
> I tried to run mnist_distributed.py using docker container, and launch failed.
> The following is my command, and the docker image tf-1.13.1-cpu-base:0.0.1
> was build in advance in mini-submarine.
> {code:java}
> java -cp $(hadoop classpath
> --glob):/opt/submarine-current/submarine-all-0.3.0-SNAPSHOT-hadoop-3.2.jar
> org.apache.submarine.client.cli.Cli job run --name tf-job-001 \
> --framework tensorflow \
> --docker_image tf-1.13.1-cpu-base:0.0.1 \
> --input_path "" \
> --num_ps 1 \
> --ps_launch_cmd "export CLASSPATH=\$(/hadoop-current/bin/hadoop classpath
> --glob) && python mnist_distributed.py --steps 2 --data_dir /tmp/data
> --working_dir /tmp/mode" \
> --ps_resources memory=1G,vcores=1 \
> --num_workers 2 \
> --worker_resources memory=1G,vcores=1 \
> --worker_launch_cmd "export CLASSPATH=\$(/hadoop-current/bin/hadoop
> classpath --glob) && python mnist_distributed.py --steps 2 --data_dir
> /tmp/data --working_dir /tmp/mode" \
> --env JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 \
> --env DOCKER_HADOOP_HDFS_HOME=/hadoop-current \
> --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 \
> --env HADOOP_HOME=/hadoop-current \
> --env HADOOP_YARN_HOME=/hadoop-current \
> --env HADOOP_COMMON_HOME=hadoop-current \
> --env HADOOP_HDFS_HOME=/hadoop-current \
> --env HADOOP_CONF_DIR=/hadoop-current/etc/hadoop \
> --conf
> tony.containers.resources=/opt/submarine-current/submarine-all-0.3.0-SNAPSHOT-hadoop-3.2.jar,/home/yarn/submarine/mnist_distributed.py
> {code}
> The following is partial NodeManager log.
> {code:java}
> 2020-03-25 13:48:32,728 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
> Container container_1585136148243_0006_01_000001 transitioned from SCHEDULED
> to RUNNING
> 2020-03-25 13:48:32,728 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
> Starting resource-monitoring for container_1585136148243_0006_01_000001
> 2020-03-25 13:48:32,740 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime:
> setting hostname in container to: ctr-1585136148243-0006-01-000001
> 2020-03-25 13:48:34,605 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime:
> Docker inspect output for container_1585136148243_0006_01_000001:
> ,ctr-1585136148243-0006-01-0000012020-03-25 13:48:34,605 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
> container_1585136148243_0006_01_000001's ip = , and hostname =
> ctr-1585136148243-0006-01-000001
> 2020-03-25 13:48:34,613 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
> Skipping monitoring container container_1585136148243_0006_01_000001 since
> CPU usage is not yet available.
> 2020-03-25 13:48:36,234 WARN
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor:
> Shell execution returned exit code: 255. Privileged Execution Operation
> Stderr:
> Docker container exit code was not zero: 255
> Unable to read from docker logs(ferror, feof): 0 1Stdout: main : command
> provided 4
> main : run as user is yarn
> main : requested yarn user is yarn
> Creating script paths...
> Creating local dirs...
> Getting exit code file...
> Changing effective user to root...
> Launching docker container...
> Inspecting docker container...
> Writing to cgroup task files...
> Writing pid file...
> Writing to tmp file
> /var/lib/hadoop-yarn/cache/yarn/nm-local-dir/nmPrivate/application_1585136148243_0006/container_1585136148243_0006_01_000001/container_1585136148243_0006_01_000001.pid.tmp
> container_1585136148243_0006_01_000001
> Waiting for docker container to finish...
> Removing docker container post-exit...
> {code}
> The following is AM stdout.log.
> {code:java}
> ========================================================================
> LogType:amstdout.log
> LogLastModifiedTime:Wed Mar 25 13:02:27 +0000 2020
> LogLength:6468
> LogContents:
> [WARN ] 2020-03-25 13:02:25,503
> method:org.apache.hadoop.util.NativeCodeLoader.<clinit>(NativeCodeLoader.java:60)
> Unable to load native-hadoop library for your platform... using builtin-java
> classes where applicable
> [ERROR] 2020-03-25 13:02:25,613
> method:com.linkedin.tony.ApplicationMaster.init(ApplicationMaster.java:217)
> Failed to create FileSystem object
> org.apache.hadoop.security.KerberosAuthException: failure to login:
> javax.security.auth.login.LoginException: java.lang.NullPointerException:
> invalid null input: name
> at com.sun.security.auth.UnixPrincipal.<init>(UnixPrincipal.java:71)
> at
> com.sun.security.auth.module.UnixLoginModule.login(UnixLoginModule.java:133)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755)
> at javax.security.auth.login.LoginContext.access$000(LoginContext.java:195)
> at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682)
> at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680)
> at javax.security.auth.login.LoginContext.login(LoginContext.java:587)
> at
> org.apache.hadoop.security.UserGroupInformation$HadoopLoginContext.login(UserGroupInformation.java:1926)
> at
> org.apache.hadoop.security.UserGroupInformation.doSubjectLogin(UserGroupInformation.java:1837)
> at
> org.apache.hadoop.security.UserGroupInformation.createLoginUser(UserGroupInformation.java:710)
> at
> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:660)
> at
> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:571)
> at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:3487)
> at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:3477)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3319)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:227)
> at com.linkedin.tony.ApplicationMaster.init(ApplicationMaster.java:215)
> at com.linkedin.tony.ApplicationMaster.run(ApplicationMaster.java:305)
> at com.linkedin.tony.ApplicationMaster.main(ApplicationMaster.java:293)
> at
> org.apache.hadoop.security.UserGroupInformation.doSubjectLogin(UserGroupInformation.java:1847)
> at
> org.apache.hadoop.security.UserGroupInformation.createLoginUser(UserGroupInformation.java:710)
> at
> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:660)
> at
> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:571)
> at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:3487)
> at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:3477)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3319)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:227)
> at com.linkedin.tony.ApplicationMaster.init(ApplicationMaster.java:215)
> at com.linkedin.tony.ApplicationMaster.run(ApplicationMaster.java:305)
> at com.linkedin.tony.ApplicationMaster.main(ApplicationMaster.java:293)
> Caused by: javax.security.auth.login.LoginException:
> java.lang.NullPointerException: invalid null input: name
> at com.sun.security.auth.UnixPrincipal.<init>(UnixPrincipal.java:71)
> at
> com.sun.security.auth.module.UnixLoginModule.login(UnixLoginModule.java:133)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755)
> at javax.security.auth.login.LoginContext.access$000(LoginContext.java:195)
> at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682)
> at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680)
> at javax.security.auth.login.LoginContext.login(LoginContext.java:587)
> at
> org.apache.hadoop.security.UserGroupInformation$HadoopLoginContext.login(UserGroupInformation.java:1926)
> at
> org.apache.hadoop.security.UserGroupInformation.doSubjectLogin(UserGroupInformation.java:1837)
> at
> org.apache.hadoop.security.UserGroupInformation.createLoginUser(UserGroupInformation.java:710)
> at
> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:660)
> at
> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:571)
> at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:3487)
> at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:3477)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3319)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:227)
> at com.linkedin.tony.ApplicationMaster.init(ApplicationMaster.java:215)
> at com.linkedin.tony.ApplicationMaster.run(ApplicationMaster.java:305)
> at com.linkedin.tony.ApplicationMaster.main(ApplicationMaster.java:293)
> at javax.security.auth.login.LoginContext.invoke(LoginContext.java:856)
> at javax.security.auth.login.LoginContext.access$000(LoginContext.java:195)
> at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682)
> at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680)
> at javax.security.auth.login.LoginContext.login(LoginContext.java:587)
> at
> org.apache.hadoop.security.UserGroupInformation$HadoopLoginContext.login(UserGroupInformation.java:1926)
> at
> org.apache.hadoop.security.UserGroupInformation.doSubjectLogin(UserGroupInformation.java:1837)
> ... 11 more
> [INFO ] 2020-03-25 13:02:25,618
> method:com.linkedin.tony.ApplicationMaster.main(ApplicationMaster.java:298)
> Application Master failed. Exiting
> End of LogType:amstdout.log
> *****************************************************************************{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]