[ https://issues.apache.org/jira/browse/AURORA-1763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15468200#comment-15468200 ]
Joshua Cohen commented on AURORA-1763: -------------------------------------- Yes, for the reasons Jie mentions, setting rootfs is not an option for Thermos. Another option would be to configure each Mesos agent host with a {{/usr/local/nvidia}} and then configure the {{--global_container_mounts}} flag on the Scheduler to point to that path. Thermos will then mount that into each task. > GPU drivers are missing when using a Docker image > ------------------------------------------------- > > Key: AURORA-1763 > URL: https://issues.apache.org/jira/browse/AURORA-1763 > Project: Aurora > Issue Type: Bug > Components: Executor > Affects Versions: 0.16.0 > Reporter: Justin Pinkul > > When launching a GPU job that uses a Docker image and the unified > containerizer the Nvidia drivers are not correctly mounted. As an experiment > I launched a task using both mesos-execute and Aurora using the same Docker > image and ran nvidia-smi. During the experiment I noticed that the > /usr/local/nvidia folder was not being mounted properly. To confirm this was > the issue I tar'ed the drivers up (/run/mesos/isolators/gpu/nvidia_352.39) > and manually added it to the Docker image. When this was done the task was > able to launch correctly. > Here is the resulting mountinfo for the mesos-execute task. Notice how > /usr/local/nvidia is mounted from the /mesos directory. > {noformat}140 102 8:17 > /mesos_work/provisioner/containers/11c497a2-a300-4c9e-a474-79aad1f28f11/backends/copy/rootfses/8ee046a6-bacb-42ff-b039-2cabda5d0e62 > / rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered > 141 140 8:17 > /mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/67025326-9dfd-4cbb-a008-454a40bce2f5-0009/executors/gpu-test/runs/11c497a2-a300-4c9e-a474-79aad1f28f11 > /mnt/mesos/sandbox rw,relatime master:24 - ext4 /dev/sdb1 > rw,errors=remount-ro,data=ordered > 142 140 0:15 /mesos/isolators/gpu/nvidia_352.39 /usr/local/nvidia > rw,nosuid,relatime master:5 - tmpfs tmpfs rw,size=26438160k,mode=755 > 143 140 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw > 144 143 0:3 /sys /proc/sys ro,relatime - proc proc rw > 145 140 0:14 / /sys ro,nosuid,nodev,noexec,relatime - sysfs sysfs rw > 146 140 0:38 / /dev rw,nosuid - tmpfs tmpfs rw,mode=755 > 147 146 0:39 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts > rw,mode=600,ptmxmode=666 > 148 146 0:40 / /dev/shm rw,nosuid,nodev - tmpfs tmpfs rw{noformat} > Here is the mountinfo when using Aurora. Notice how /usr/local/nvidia is > missing. > {noformat}72 71 8:1 / / rw,relatime master:1 - ext4 /dev/sda1 > rw,errors=remount-ro,data=ordered > 73 72 0:5 / /dev rw,relatime master:2 - devtmpfs udev > rw,size=10240k,nr_inodes=16521649,mode=755 > 74 73 0:11 / /dev/pts rw,nosuid,noexec,relatime master:3 - devpts devpts > rw,gid=5,mode=620,ptmxmode=000 > 75 73 0:17 / /dev/shm rw,nosuid,nodev master:4 - tmpfs tmpfs rw > 76 73 0:13 / /dev/mqueue rw,relatime master:21 - mqueue mqueue rw > 77 73 0:30 / /dev/hugepages rw,relatime master:23 - hugetlbfs hugetlbfs rw > 78 72 0:15 / /run rw,nosuid,relatime master:5 - tmpfs tmpfs > rw,size=26438160k,mode=755 > 79 78 0:18 / /run/lock rw,nosuid,nodev,noexec,relatime master:6 - tmpfs tmpfs > rw,size=5120k > 80 78 0:32 / /run/rpc_pipefs rw,relatime master:25 - rpc_pipefs rpc_pipefs rw > 82 72 0:14 / /sys rw,nosuid,nodev,noexec,relatime master:7 - sysfs sysfs rw > 83 82 0:16 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime master:8 - > securityfs securityfs rw > 84 82 0:19 / /sys/fs/cgroup ro,nosuid,nodev,noexec master:9 - tmpfs tmpfs > ro,mode=755 > 85 84 0:20 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime master:10 > - cgroup cgroup > rw,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd > 86 84 0:22 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime master:13 > - cgroup cgroup rw,cpuset > 87 84 0:23 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime > master:14 - cgroup cgroup rw,cpu,cpuacct > 88 84 0:24 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime master:15 > - cgroup cgroup rw,devices > 89 84 0:25 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime master:16 > - cgroup cgroup rw,freezer > 90 84 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime > master:17 - cgroup cgroup rw,net_cls,net_prio > 91 84 0:27 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime master:18 - > cgroup cgroup rw,blkio > 92 84 0:28 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime > master:19 - cgroup cgroup rw,perf_event > 93 82 0:21 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime master:11 - > pstore pstore rw > 94 82 0:6 / /sys/kernel/debug rw,relatime master:22 - debugfs debugfs rw > 95 72 0:3 / /proc rw,nosuid,nodev,noexec,relatime master:12 - proc proc rw > 96 95 0:29 / /proc/sys/fs/binfmt_misc rw,relatime master:20 - autofs > systemd-1 rw,fd=22,pgrp=1,timeout=300,minproto=5,maxproto=5,direct > 97 96 0:34 / /proc/sys/fs/binfmt_misc rw,relatime master:27 - binfmt_misc > binfmt_misc rw > 98 72 8:17 / /mnt/01 rw,relatime master:24 - ext4 /dev/sdb1 > rw,errors=remount-ro,data=ordered > 99 98 8:17 > /mesos_work/provisioner/containers/3790dd16-d1e2-4974-ba21-095a029b8c7d/backends/copy/rootfses/7ce26962-10a7-40ec-843b-c76e7e29c88d > > /mnt/01/mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/13e02526-f2b7-4677-bb23-0faeeac65be9-0000/executors/thermos-root-devel-gpu_test-0-beeb742b-28c1-46f3-b49f-23443b6efcc2/runs/3790dd16-d1e2-4974-ba21-095a029b8c7d/taskfs > rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered > 100 99 8:17 > /mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/13e02526-f2b7-4677-bb23-0faeeac65be9-0000/executors/thermos-root-devel-gpu_test-0-beeb742b-28c1-46f3-b49f-23443b6efcc2/runs/3790dd16-d1e2-4974-ba21-095a029b8c7d/sandbox > > /mnt/01/mesos_work/slaves/67025326-9dfd-4cbb-a008-454a40bce2f5-S1/frameworks/13e02526-f2b7-4677-bb23-0faeeac65be9-0000/executors/thermos-root-devel-gpu_test-0-beeb742b-28c1-46f3-b49f-23443b6efcc2/runs/3790dd16-d1e2-4974-ba21-095a029b8c7d/taskfs/mnt/mesos/sandbox > rw,relatime master:24 - ext4 /dev/sdb1 rw,errors=remount-ro,data=ordered > 67 78 0:33 / /run/user/1001 rw,nosuid,nodev,relatime master:26 - tmpfs tmpfs > rw,size=13219080k,mode=700,uid=1001,gid=1001{noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)