Thanks for the confirmation Joe! On 20 July 2016 at 12:19, Joe Topjian <[email protected]> wrote: > Hi Blair, > > We only updated qemu. We're running the version of libvirt from the Kilo > cloudarchive. > > We've been in production with our K80s for around two weeks now and have had > several users report success. > > Thanks, > Joe > > On Tue, Jul 19, 2016 at 5:06 PM, Blair Bethwaite <[email protected]> > wrote: >> >> Hilariously (or not!) we finally hit the same issue last week once >> folks actually started trying to do something (other than build and >> load drivers) with the K80s we're passing through. This >> >> https://devtalk.nvidia.com/default/topic/850833/pci-passthrough-kvm-for-cuda-usage/ >> is the best discussion of the issue I've found so far, haven't tracked >> down an actual bug yet though. I wonder whether it has something to do >> with the memory size of the device, as we've been happy for a long >> time with other NVIDIA GPUs (GRID K1, K2, M2070, ...). >> >> Jon, when you grabbed Mitaka Qemu, did you also update libvirt? We're >> just working through this and have tried upgrading both but are >> hitting some issues with Nova and Neutron on the compute nodes, >> thinking it may libvirt related but debug isn't helping much yet. >> >> Cheers, >> >> On 8 July 2016 at 00:54, Jonathan Proulx <[email protected]> wrote: >> > On Thu, Jul 07, 2016 at 11:13:29AM +1000, Blair Bethwaite wrote: >> > :Jon, >> > : >> > :Awesome, thanks for sharing. We've just run into an issue with SRIOV >> > :VF passthrough that sounds like it might be the same problem (device >> > :disappearing after a reboot), but haven't yet investigated deeply - >> > :this will help with somewhere to start! >> > >> > :By the way, the nouveau mention was because we had missed it on some >> > :K80 hypervisors recently and seen passthrough apparently work, but >> > :then the NVIDIA drivers would not build in the guest as they claimed >> > :they could not find a supported device (despite the GPU being visible >> > :on the PCI bus). >> > >> > Definitely sage advice! >> > >> > :I have also heard passing mention of requiring qemu >> > :2.3+ but don't have any specific details of the related issue. >> > >> > I didn't do a bisection but with qemu 2.2 (from ubuntu cloudarchive >> > kilo) I was sad and with 2.5 (from ubuntu cloudarchive mitaka but >> > installed on a kilo hypervisor) I am working. >> > >> > Thanks, >> > -Jon >> > >> > >> > :Cheers, >> > : >> > :On 7 July 2016 at 08:13, Jonathan Proulx <[email protected]> wrote: >> > :> On Wed, Jul 06, 2016 at 12:32:26PM -0400, Jonathan D. Proulx wrote: >> > :> : >> > :> :I do have an odd remaining issue where I can run cuda jobs in the vm >> > :> :but snapshots fail and after pause (for snapshotting) the pci device >> > :> :can't be reattached (which is where i think it deletes the snapshot >> > :> :it took). Got same issue with 3.16 and 4.4 kernels. >> > :> : >> > :> :Not very well categorized yet, but I'm hoping it's because the VM I >> > :> :was hacking on had it's libvirt.xml written out with the older qemu >> > :> :maybe? It had been through a couple reboots of the physical system >> > :> :though. >> > :> : >> > :> :Currently building a fresh instance and bashing more keys... >> > :> >> > :> After an ugly bout of bashing I've solve my failing snapshot issue >> > :> which I'll post here in hopes of saving someonelse >> > :> >> > :> Short version: >> > :> >> > :> add "/dev/vfio/vfio rw," to >> > /etc/apparmor.d/abstractions/libvirt-qemu >> > :> add "ulimit -l unlimited" to /etc/init/libvirt-bin.conf >> > :> >> > :> Longer version: >> > :> >> > :> What was happening. >> > :> >> > :> * send snapshot request >> > :> * instance pauses while snapshot is pending >> > :> * instance attempt to resume >> > :> * fails to reattach pci device >> > :> * nova-compute.log >> > :> Exception during message handling: internal error: unable to >> > execute QEMU command 'device_add': Device initialization failedcompute.log >> > :> >> > :> * qemu/<id>.log >> > :> vfio: failed to open /dev/vfio/vfio: Permission denied >> > :> vfio: failed to setup container for group 48 >> > :> vfio: failed to get group 48 >> > :> * snapshot disappears >> > :> * instance resumes but without passed through device (hard reboot >> > :> reattaches) >> > :> >> > :> seeing permsission denied I though would be an easy fix but: >> > :> >> > :> # ls -l /dev/vfio/vfio >> > :> crw-rw-rw- 1 root root 10, 196 Jul 6 14:05 /dev/vfio/vfio >> > :> >> > :> so I'm guessing I'm in apparmor hell, I try adding "/dev/vfio/vfio >> > :> rw," to /etc/apparmor.d/abstractions/libvirt-qemu rebooting the >> > :> hypervisor and trying again which gets me a different libvirt error >> > :> set: >> > :> >> > :> VFIO_MAP_DMA: -12 >> > :> vfio_dma_map(0x5633a5fa69b0, 0x0, 0xa0000, 0x7f4e7be00000) = -12 >> > (Cannot allocate memory) >> > :> >> > :> kern.log (and thus dmesg) showing: >> > :> vfio_pin_pages: RLIMIT_MEMLOCK (65536) exceeded >> > :> >> > :> Getting rid of this one required inserting 'ulimit -l unlimited' into >> > :> /etc/init/libvirt-bin.conf in the 'script' section: >> > :> >> > :> <previous bits excluded> >> > :> script >> > :> [ -r /etc/default/libvirt-bin ] && . /etc/default/libvirt-bin >> > :> ulimit -l unlimited >> > :> exec /usr/sbin/libvirtd $libvirtd_opts >> > :> end script >> > :> >> > :> >> > :> -Jon >> > :> >> > :> _______________________________________________ >> > :> OpenStack-operators mailing list >> > :> [email protected] >> > :> >> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators >> > : >> > : >> > : >> > :-- >> > :Cheers, >> > :~Blairo >> > >> > -- >> >> >> >> -- >> Cheers, >> ~Blairo >> >> _______________________________________________ >> OpenStack-operators mailing list >> [email protected] >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators > >
-- Cheers, ~Blairo _______________________________________________ OpenStack-operators mailing list [email protected] http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
