[jupyter] Kubernetes NVIDIA GPU/extraVolumeMount issues

Benedikt Bäumle Thu, 13 Sep 2018 03:53:44 -0700

Hey guys,

I am currently setting up a Kubernetes bare-metal single node cluster + 
Jupyterhub for having control over resources for our users. I use Helm to 
set up jupyterhub with a custom singleuser-notebook image for deep learning.


The idea is to set up the hub to have better control over NVIDIA GPUs on 
the server.

I am struggling with some things I can't figure out how to do or if that is 
even possible:

1. I mount the home directory of the user to the notebook user ( in our 
case /home/dbvis/ ) in the helm chart values.yaml:

    extraVolumes:
        - name: home
          hostPath:
            path: /home/{username}
    extraVolumeMounts:
        - name: home
          mountPath: /home/dbvis/data

It is indeed mounted like this, but with root:root ownership and I can't 
add/remove/change anything inside the container at /home/dbvis/data. What 
is tried out:

- I tried to change the ownership in the Dockerfile by running 'chown -R 
dbvis:dbvis /home/dbvis/' in the end as root user
- I tried through the following postStart hook in the values.yaml

    lifecycleHooks:
      postStart:
        exec:
          command: ["chown", "-R", "dbvis:dbvis", "/home/dbvis/data"]

Both didn't work...as storageclass I set up rook with rook-ceph-block 
storage.
Any ideas?


2. We have several NVIDIA GPUs and I would like to control them and set 
limits for the jupyter singelser-notebooks. I set up 'nvidia device plugin' 
( https://github.com/NVIDIA/k8s-device-plugin ). 
When I use 'kubectl describe node' I find the GPU as resource:

Allocatable:
 cpu:                16
 ephemeral-storage:  189274027310
 hugepages-1Gi:      0
 hugepages-2Mi:      0
 memory:             98770548Ki
 nvidia.com/gpu:     1
 pods:               110
...
...
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource        Requests     Limits
  --------        --------     ------
  cpu             2250m (14%)  4100m (25%)
  memory          2238Mi (2%)  11146362880 (11%)
  nvidia.com/gpu  0            0
Events:           <none>

Inside the jupyter singleuser-notebooks I can see the GPU, when executing 
'nvidia-smi'. 
But if I run e.g. tensorflow to see the GPU with the following code:

from tensorflow.python.client import device_lib

device_lib.list_local_devices()

I just get the CPU device:

[name: "/device:CPU:0"
 device_type: "CPU"
 memory_limit: 268435456
 locality {
 }
 incarnation: 232115754901553261]


Any idea what I am doing wrong? 

Further, I would like to limit the amount of GPUs ( It is just a test 
environment with one GPU we have more ). I tried the following which 
doesn't seem to have an effect:

- Add the following config in values.yaml in any combination possible:

  extraConfig: |
     c.Spawner.notebook_dir = '/home/dbvis'
     c.Spawner.extra_resource_limits: {'nvidia.com/gpu': '0'}
     c.Spawner.extra_resource_guarantees: {'nvidia.com/gpu': '0'}
     c.Spawner.args = ['--device=/dev/nvidiactl', 
'--device=/dev/nvidia-uvm', '--device=/dev/nvidia-uvm-tools', 
'/dev/nvidia0' ]

- Add the GPU to the resources in the singleuser configuration in 
values.yaml:

singleuser:
  image:
    name: benne4444/dbvis-singleuser
    tag: test3
  nvidia.com/gpu:
    limit: 1
    guarantee: 1

Is what I am trying even possible right now?

Further information:

I set up a server running 

- Ubuntu 18.04.1 LTS
- docker-nvidia
- helm jupyterhub version 0.8-ea0cf9a

I added the complete values.yaml.

If you need additional information please let me know. Any help is 
appreciated a lot.

Thank you,
Benedikt




-- 
You received this message because you are subscribed to the Google Groups 
"Project Jupyter" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/jupyter/585d4d0b-5d8d-4cf2-b109-2c16f93d1f62%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

values.yaml
Description: Binary data

[jupyter] Kubernetes NVIDIA GPU/extraVolumeMount issues

Reply via email to