Bjorn, I don't know if you're still experimenting with Myriad, but I believe I've got a fix for your issue. I'm going to try to get it in our next release, so if you have any feedback it would be great. I verified it on a couple small systems.
https://github.com/apache/incubator-myriad/pull/69 On Wed, Mar 23, 2016 at 8:17 AM, Darin Johnson <dbjohnson1...@gmail.com> wrote: > Hey, Bjorn sorry for the delay, looking at the difference between the > exceptions and my own experience I believe you left some cgroup configs in > yarn-site.xml of the node manager. > On Mar 18, 2016 2:58 AM, "Björn Hagemeier" <b.hageme...@fz-juelich.de> > wrote: > >> Hi Darin, >> >> thanks a lot for this. But what about the other case below, when cgroups >> is disabled? >> >> >> Björn >> >> Am 18.03.2016 um 00:25 schrieb Darin Johnson: >> > Hey Bjorn, >> > >> > I think I figured out the issue. Some of the values for cgroups are >> still >> > hardcoded in myriad. I'll add a JIRA Ticket hopefully we can get an >> update >> > for 0.2.0. I'll also respond to this thread after a pull request is >> > submitted in case you'd like to test it. >> > >> > Darin >> > Hi all, >> > >> > I have trouble starting the NM on the slave nodes. Apparently, it does >> > not find it's configuration or sth. is wrong with the configuration. >> > >> > With cgroups enabled, the NM does not start, the logs contain, >> > indicating that there is sth. wrong in the configuratin. However, >> > yarn.nodemanager.linux-container-executor.group is set (to "yarn"). The >> > value used to be "${yarn.nodemanager.linux-container-executor.group}" as >> > indicated by the installation documentation, however I'm uncertain >> > whether this recursion is the correct approach. >> > >> > >> > ================================================== >> > 16/03/14 09:32:45 FATAL nodemanager.NodeManager: Error starting >> NodeManager >> > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to >> > initialize container executor >> > at >> > >> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:213) >> > at >> > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) >> > at >> > >> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:474) >> > at >> > >> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:521) >> > Caused by: java.io.IOException: Linux container executor not configured >> > properly (error=24) >> > at >> > >> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:193) >> > at >> > >> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:211) >> > ... 3 more >> > Caused by: ExitCodeException exitCode=24: Can't get configured value for >> > yarn.nodemanager.linux-container-executor.group. >> > >> > at org.apache.hadoop.util.Shell.runCommand(Shell.java:543) >> > at org.apache.hadoop.util.Shell.run(Shell.java:460) >> > at >> > >> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:720) >> > at >> > >> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:187) >> > ... 4 more >> > ================================================== >> > >> > >> > I have given it another try with cgroups disabled (in >> > myriad-config-default.yml), I seem to get a little further, but still >> > stuck at running Yarn jobs: >> > >> > ================================================== >> > 16/03/14 10:56:34 INFO container.Container: Container >> > container_1457949199710_0001_01_000001 transitioned from LOCALIZED to >> > RUNNING >> > 16/03/14 10:56:34 INFO nodemanager.DefaultContainerExecutor: >> > launchContainer: [bash, >> > >> /var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/bjoernh/appcache/application_1457949199710_0001/container_1457949199710_0001_01_000001/default_container_executor.sh] >> > 16/03/14 10:56:34 WARN nodemanager.DefaultContainerExecutor: Exit code >> > from container container_1457949199710_0001_01_000001 is : 1 >> > 16/03/14 10:56:34 WARN nodemanager.DefaultContainerExecutor: Exception >> > from container-launch with container ID: >> > container_1457949199710_0001_01_000001 and exit code: 1 >> > ExitCodeException exitCode=1: >> > at org.apache.hadoop.util.Shell.runCommand(Shell.java:543) >> > at org.apache.hadoop.util.Shell.run(Shell.java:460) >> > at >> > >> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:720) >> > at >> > >> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:210) >> > at >> > >> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) >> > at >> > >> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) >> > at java.util.concurrent.FutureTask.run(FutureTask.java:266) >> > at >> > >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >> > at >> > >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >> > at java.lang.Thread.run(Thread.java:745) >> > 16/03/14 10:56:34 INFO nodemanager.ContainerExecutor: Exception from >> > container-launch. >> > 16/03/14 10:56:34 INFO nodemanager.ContainerExecutor: Container id: >> > container_1457949199710_0001_01_000001 >> > 16/03/14 10:56:34 INFO nodemanager.ContainerExecutor: Exit code: 1 >> > ================================================== >> > >> > Unfortunately, directory >> > /var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/bjoernh/appcache/ >> > is empty, the log indicates that it is being deleted after the failed >> > attempt. >> > >> > Again, any hint would be useful. Also regarding the activation of >> cgroups. >> > >> > >> > Best regards, >> > Björn >> > >> > -- >> > Dipl.-Inform. Björn Hagemeier >> > Federated Systems and Data >> > Juelich Supercomputing Centre >> > Institute for Advanced Simulation >> > >> > Phone: +49 2461 61 1584 >> > Fax : +49 2461 61 6656 >> > Email: b.hageme...@fz-juelich.de >> > Skype: bhagemeier >> > WWW : http://www.fz-juelich.de/jsc >> > >> > JSC is the coordinator of the >> > John von Neumann Institute for Computing >> > and member of the >> > Gauss Centre for Supercomputing >> > >> > >> ------------------------------------------------------------------------------------- >> > >> ------------------------------------------------------------------------------------- >> > Forschungszentrum Juelich GmbH >> > 52425 Juelich >> > Sitz der Gesellschaft: Juelich >> > Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498 >> > Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher >> > Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender), >> > Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt, >> > Prof. Dr. Sebastian M. Schmidt >> > >> ------------------------------------------------------------------------------------- >> > >> ------------------------------------------------------------------------------------- >> > >> >> >> -- >> Dipl.-Inform. Björn Hagemeier >> Federated Systems and Data >> Juelich Supercomputing Centre >> Institute for Advanced Simulation >> >> Phone: +49 2461 61 1584 >> Fax : +49 2461 61 6656 >> Email: b.hageme...@fz-juelich.de >> Skype: bhagemeier >> WWW : http://www.fz-juelich.de/jsc >> >> JSC is the coordinator of the >> John von Neumann Institute for Computing >> and member of the >> Gauss Centre for Supercomputing >> >> >> ------------------------------------------------------------------------------------- >> >> ------------------------------------------------------------------------------------- >> Forschungszentrum Juelich GmbH >> 52425 Juelich >> Sitz der Gesellschaft: Juelich >> Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498 >> Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher >> Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender), >> Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt, >> Prof. Dr. Sebastian M. Schmidt >> >> ------------------------------------------------------------------------------------- >> >> ------------------------------------------------------------------------------------- >> >>