On Tue, Jul 11, 2017 at 11:00:47PM +1000, David Gibson wrote: > On Mon, Jul 10, 2017 at 05:37:31PM -0300, Daniel Henrique Barboza wrote: > > > > > > On 07/10/2017 03:39 AM, David Gibson wrote: > > > On Fri, Jul 07, 2017 at 06:20:37PM -0300, Daniel Henrique Barboza wrote: > > > > "spapr: Remove 'awaiting_allocation' DRC flag" removed the flag that > > > > was originally was being used to prevent a race condition between > > > > hot unplug and hotplug. The DRC code base got simplified and more > > > > robust over time, eliminating the conditions that led to this race. > > > > Thus the awaiting_allocation existence wasn't justifiable anymore. > > > > > > > > A side effect of the flag removal was seen when testing the Libvirt > > > > hotplug-migration-unplug scenario, where a device is hotplugged in both > > > > source and target using device_add prior to the migration, then the > > > > device is removed after migration in the target. Before that cleanup, > > > > the > > > > hot unplug at the target fails in both QEMU and guest kernel because > > > > the DRC state at the target is inconsistent. After removing that flag, > > > > the hot unplug works at QEMU but the guest kernel hungs on the middle > > > > of the unplug process. > > > > > > > > It turns out that the awaiting_allocation logic was preventing the hot > > > > unplug from happening at the target because the DRC state, at this > > > > specific > > > > hot unplug scenario, was matching the race condition the flag was > > > > originally designed to avoid. Removing the flag allowed the device > > > > to be removed from QEMU, leading to this new behavior. > > > > > > > > The root cause of those problems is, in fact, the inconsistent state of > > > > the > > > > target DRCs after migration is completed. Doing device_add in the > > > > INMIGRATE status leaves the DRC in a state that isn't recognized as a > > > > valid hotplugged device in the guest OS. > > > > > > > > This patch fixes the problem by using the recently modified 'drc_reset' > > > > function, that now forces the DRC to a known state by checking its > > > > device > > > > status, to reset all DRCs in the pre_load hook of the migration. > > > > Resetting > > > > the DRCs in pre_load allows the DRCs to be in a predictable state when > > > > we load the migration at the target, allowing for hot unplugs to work > > > > as expected. > > > > > > > > Signed-off-by: Daniel Henrique Barboza <[email protected]> > > > Ok, so the fact this works is pretty promising. However, I'm still > > > trying to fully understand what's going on here. I have a suspicion > > > that this is only necessary because something isn't quite right with > > > the reset / inmigrate sequencing in the generic code, which we should > > > fix instead of hacking around. > > > > Agreed. > > > > > > > > IIUC, in the problem case, on the source the hotplug has fully > > > completed, so the DRC will be in CONFIGURED state. Since the device > > > is CONFIGURED and attached, no DRC info is sent in the migration > > > stream. On the destination what seems to be happening is: > > > > > > 1. qemu is started with "-incoming defer", and cpu *not* present > > > > > > DRC is uninitialized > > > > > > 2. qemu_system_reset() is called in vl.c > > > > > > DRC is in UNALLOCATED / detached state > > > > > > 3. libvirt device_adds the cpu > > > > > > DRC is in UNALLOCATED / attached state > > > > > > 4. libvirt initiates incoming migration > > > > > > DRC remains in UNALLOCATED / attached state > > > > > > 5. Guest resumes on the destination > > > > > > DRC still in UNALLOCATED / attached state > > > > > > Which mismatches what we had on the source so => bug. > > > > > > BUT, AFAIK the libvirt coldplug case below *is* working. Which > > > tracing through the code I'd expect: > > > > > > 1. qemu is started with -S and cpu not present > > > > > > DRC is uninitialized > > > > > > 2. qemu_system_reset() is called in vl.c > > > > > > DRC is in UNALLOCATED / detached state > > > > > > 3. libvirt device_adds in prelaunch phase > > > > > > DRC is in UNALLOCATED / attached state > > > > > > 4. Guest is started > > > > > > DRC is in UNALLOCATED / attached state > > > > > > Which is also incorrect: the device was present when the guest > > > started, so it should be in CONFIGURED state. IIUC this case is > > > working, so I think it is must actually be in CONFIGURED state. > > > > Just did a test here and the device isn't present when the guest starts in > > the second > > example you mentioned, Tested with current qemu master. QEMU shows the > > extra > > CPU as 'halted' always, even after the guest starts and OS boots up: > > > > danielhb@louis:~/qemu/build/ppc64-softmmu$ sudo ./qemu-system-ppc64 -name > > migrate_qemu -boot strict=on --enable-kvm -device > > nec-usb-xhci,id=usb,bus=pci.0,addr=0xf -device > > spapr-vscsi,id=scsi0,reg=0x2000 -smp 1,maxcpus=4,sockets=4,cores=1,threads=1 > > --machine pseries,accel=kvm,usb=off,dump-guest-core=off -m > > 4G,slots=32,maxmem=32G -drive > > file=/home/danielhb/vm_imgs/ubuntu1704.qcow2,format=qcow2,if=none,id=drive-virtio-disk0,cache=none > > -device > > virtio-blk-pci,scsi=off,bus=pci.0,addr=0x2,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 > > -nographic -S > > QEMU 2.9.50 monitor - type 'help' for more information > > > > <<<<< at this point qemu_system_reset is called, as expected >>>>> > > > > (qemu) device_add host-spapr-cpu-core,id=core1,core-id=1 > > (qemu) info cpus > > * CPU #0: nip=0x0000000000000100 thread_id=16523 > > CPU #1: nip=0x0000000000000000 (halted) thread_id=16598 > > (qemu) cont > > > > --- guest boots up ---- > > > > (qemu) info cpus > > * CPU #0: nip=0xc0000000000a3e0c thread_id=16523 > > CPU #1: nip=0x0000000000000000 (halted) thread_id=16598 > > > > danielhb@ubuntu1704:~$ lscpu > > Architecture: ppc64le > > Byte Order: Little Endian > > CPU(s): 1 > > On-line CPU(s) list: 0 > > Thread(s) per core: 1 > > Core(s) per socket: 1 > > Socket(s): 1 > > NUMA node(s): 1 > > Model: 2.1 (pvr 004b 0201) > > Model name: POWER8E (raw), altivec supported > > Hypervisor vendor: horizontal > > Virtualization type: full > > L1d cache: 64K > > L1i cache: 32K > > NUMA node0 CPU(s): 0 > > danielhb@ubuntu1704:~$ (qemu) > > (qemu) device_del core1 > > (qemu) info cpus > > * CPU #0: nip=0xc0000000000a3e0c thread_id=16523 > > CPU #1: nip=0x0000000000000000 (halted) thread_id=16598 > > > > danielhb@ubuntu1704:~$ lscpu > > Architecture: ppc64le > > Byte Order: Little Endian > > CPU(s): 1 > > On-line CPU(s) list: 0 > > Thread(s) per core: 1 > > Core(s) per socket: 1 > > Socket(s): 1 > > NUMA node(s): 1 > > Model: 2.1 (pvr 004b 0201) > > Model name: POWER8E (raw), altivec supported > > Hypervisor vendor: horizontal > > Virtualization type: full > > L1d cache: 64K > > L1i cache: 32K > > NUMA node0 CPU(s): 0 > > danielhb@ubuntu1704:~$ dmesg | tail -n 5 > > [ 6.307988] audit: type=1400 audit(1499705034.060:10): apparmor="STATUS" > > operation="profile_load" profile="unconfined" name="/usr/bin/lxc-start" > > pid=2212 comm="apparmor_parser" > > [ 6.318556] audit: type=1400 audit(1499705034.068:11): apparmor="STATUS" > > operation="profile_load" profile="unconfined" > > name="/usr/lib/snapd/snap-confine" pid=2213 comm="apparmor_parser" > > [ 7.087170] cgroup: new mount options do not match the existing > > superblock, will be ignored > > [ 88.093598] pseries-hotplug-cpu: Failed to acquire DRC, rc: -22, drc > > index: 10000008 > > [ 88.093606] pseries-hotplug-cpu: Cannot find CPU (drc index 10000008) to > > remove > > danielhb@ubuntu1704:~$ > > > > > > Debugging it a little I see that device_adding a CPU while the VM isn't > > started yet is being considered > > "hotplugged" by spapr_core_plug (dev->hotplugged is True). Also, there is a > > note in 'spapr_cpu_reset' > > saying: > > > > /* All CPUs start halted. CPU0 is unhalted from the machine level > > * reset code and the rest are explicitly started up by the guest > > * using an RTAS call */ > > cs->halted = 1; > > > > And yeah, the guest isn't calling 'start-cpu' and the CPU remains halted. > > When comparing to > > a scenario where I start the VM with 2 cpus in the command line, the first > > one is started by the > > machine reset and the other one by the RTAS call 'start-cpu', as expected > > I'll investigate why this > > is happening - starting with 2 coldplugged CPUs versus one coldplugged CPU > > and a second one > > attached with device_add with while on -S should yield the same outcome. > > > > > > All this said, I am not sure if this behavior has the same root cause as the > > migration problem > > this patch solves with the reset on pre_load though. Hopefully I'll know > > more in these next days. > > Ah! So it's broken for the prelaunch case as well, though in a > slightly different way. Actually for me the breakage is less obvious > - if I plug the cpu at prelaunch, I *do* get 2 cpus appearing in the > running system. But tracing through, that's because the hotplug > message was queued and gets processed during boot. That gets to the > right place in the end, but it's kind of silly going through the > hotplug logic. > > I thought there was a system reset after the prelaunch phase, but I > was mistaken. > > I can see two ways to address this: > 1) add in a DRC reset before starting up the machine, for both the > prelaunch and inmigrate cases. Your draft patch does the second, > but I don't see an obvious place to put a hook for the first > > 2) Change the plug (and unplug) paths to skip the notification and > gradual state change, and just immediately jump to the completed > state when called in the prelaunch or inmigrate. (Easiest way > would be just to call the drc reset function instead of queueing > an event). > > (2) is basically the approach Laurent proposed in a patch a little > while ago, defining an spapr_hotplugged() function that always > returned false in prelaunch or inmigrate states. > > At the time I was dubious about that approach, because I thought we > had a natural reset point after that. After more careful > investigation, I think that's not the case however, so I'm inclined to > go with approach (2), polish up Laurent's patch and apply that.
Uh.. wait, realised this approach is wrong for the non-migration
case. For the hotplug-during-prelaunch, it's not sufficient to just
reset the DRCs. For the device to be truly coldplugged - with the DRC
going straight to CONFIGURED state, it must also appear in the base
device tree, and that requires a full system reset. Well... or CAS,
which complicates matters again.
Ok, now I'm torn between options (1) and (2) again - we basically have
a patch for each approach (yours for 1, and Laurent's for 2).
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
signature.asc
Description: PGP signature
