Hi Jon, We were also running into issues with the K80s.
For our GPU nodes, we've gone with a 4.2 or 4.4 kernel. PCI Passthrough works much better in those releases. (I ran into odd issues with 4.4 and NFS, downgraded to 4.2 after a few hours of banging my head, problems went away, not a scientific solution :) After that, make sure vfio is loaded: $ lsmod | grep vfio Then start with the "deviceQuery" CUDA sample. We've found deviceQuery to be a great check to see if the instance has full/correct access to the card. If deviceQuery prints a report within 1-2 seconds, all is well. If there is a lag, something is off. In our case for the K80s, that final "something" was qemu. We came across this[1] wiki page (search for K80) and started digging into qemu. tl;dr: upgrading to the qemu packages found in the Ubuntu Mitaka cloud archive solved our issues. Hope that helps, Joe 1: https://pve.proxmox.com/wiki/Pci_passthrough <https://pve.proxmox.com/wiki/Pci_passthrough> On Wed, Jul 6, 2016 at 7:27 AM, Jonathan D. Proulx <[email protected]> wrote: > Hi All, > > Trying to spass through some Nvidia K80 GPUs to soem instance and have > gotten to the place where Nova seems to be doing the right thing gpu > instances scheduled on the 1 gpu hypervisor I have and for inside the > VM I see: > > root@gpu-x1:~# lspci | grep -i k80 > 00:06.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) > > And I can install nvdia-361 driver and get > > # ls /dev/nvidia* > /dev/nvidia0 /dev/nvidiactl /dev/nvidia-uvm /dev/nvidia-uvm-tools > > Once I load up cuda-7.5 and build the exmaples none fo the run > claiming there's no cuda device. > > # ./matrixMul > [Matrix Multiply Using CUDA] - Starting... > cudaGetDevice returned error no CUDA-capable device is detected (code 38), > line(396) > cudaGetDeviceProperties returned error no CUDA-capable device is detected > (code 38), line(409) > MatrixA(160,160), MatrixB(320,160) > cudaMalloc d_A returned error no CUDA-capable device is detected (code > 38), line(164) > > I'm not familiar with cuda really but I did get some example code > running on the physical system for burn in over the weekend (sicne > reinstaleld so no nvidia driver on hypervisor). > > Following various online examples for setting up pass through I set > the kernel boot line on the hypervisor to: > > # cat /proc/cmdline > BOOT_IMAGE=/boot/vmlinuz-3.13.0-87-generic > root=UUID=d9bc9159-fedf-475b-b379-f65490c71860 ro console=tty0 > console=ttyS1,115200 intel_iommu=on iommu=pt rd.modules-load=vfio-pci > nosplash nomodeset intel_iommu=on iommu=pt rd.modules-load=vfio-pci > nomdmonddf nomdmonisw > > Puzzled that I apparently have the device but it is apparently > nonfunctional, where do I even look from here? > > -Jon > > > _______________________________________________ > OpenStack-operators mailing list > [email protected] > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators >
_______________________________________________ OpenStack-operators mailing list [email protected] http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
