Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support

Michael R. Hines Sat, 01 Jun 2013 21:23:33 -0700

All,

I have successfully performed over 1000+ back-to-back RDMA migrationsautomatically looped *in a row* using a heavy-weight memory-stressbenchmark here at IBM.Migration success is done by capturing the actual serial console outputof the virtual machine while the benchmark is running and redirectingeach migration output to a file to verify that the output matches theexpected output of a successful migration. For half of the 1000migrations, I used a 14GB virtual machine size (largest VM I can create)and the remaining 500 migrations I used a 2GB virtual machine (to makesure I was testing both 32-bit and 64-bit address boundaries). Thebenchmark is configured to have 75% stores and 25% loads and isconfigured to use 80% of the allocatable free memory of the VM (i.e. noswapping allowed).


I have defined a successful migration per the output file as follows:

1. The memory benchmark is still running and active (CPU near 100% andmemory usage is high)2. There are no kernel panics in the console output (regex keywords"panic", "BUG", "oom", etc...)

3. The VM is still responding to network activity (pings)

4. The console is still responsive by printing periodic messagesthroughout the life of the VM to the console from inside the VM usingthe 'write' command in infinite loop.

With this method in a loop, I believe I've ironed out all theregression-testing bugs that I can find. You all may find the followingbugs interesting. The original version of this patch was written in 2010(Before my time @ IBM).

Bug #1: In the original 2010 patch, each write operation uses the same"identifier". (A "Work Request ID" in infiniband terminology).This is not typical (but allowed by the hardware) - and instead eachoperation should have its own unique identifier so that the writeoperation can be tracked properly as it completes.

Bug #2: Also in the original 2010 patch, write operations were groupedinto separate "signaled" and "unsignaled" work requests, which is alsonot typical (but allowed by the hardware). "Signalling" is infinibandterminology which means to activate/deactivate notifying the senderwhether or not the RDMA operation has already completed. (Note: thereceiver is never notified - which is what a DMA is supposed to be). Innormal operation per infiniband specifications, "unsignaled" operations(which indicate to the hardware *not* to notify the sender ofcompletion) are *supposed* to be paired simultaneously with a signaledoperation using the *same* work request identifier. Instead, theoriginal patch was using *different* work requests forsignaled/unsignaled writes, which means that most of the writes would betransmitted without ever being tracked for completion whatsoever. (Perinfinband specifications, signaled and unsignaled writes must be groupedtogether because the hardware ensures that completion notification isnot given until *all* of the writes of the same request have actuallycompleted).

Bug #3: Finally, in the original 2010 patch, ordering was not beinghandled. Per infiniband specifications, writes can happen completely outof order. Not only that, but PCI-express itself can change the order ofthe writes as well. It was only until after the first 2 bugs were fixedthat I could actually manifest this bug *in code*: What was happeningwas that a very large group of requests would "burst" from the QEMUmigration thread. At which point, not all of the requests would finish.Then a short time later, the next iteration would start and the virtualmachine's writable working set was still "hovering" somewhere in thesame vicinity of the address space as the previous burst of writes thathad not yet completed. When this happens, the new writes were muchsmaller (not a part of a larger "chunk" per our algorithms). Since thenew writes were smaller they would complete faster than the larger,older writes in the same address range. Since they complete out oforder, the newer writes would then get clobbered by the older writes -resulting in an inconsistent virtual machine. So, to solve this: duringeach new write, we now do a "search" to see if the address of the nextrequested write matches or overlaps with the address range of any of theprevious "outstanding" writes that were still in transit, and I foundseveral hits. This was easily solved by blocking until the conflictingwrite has completed before proceeding to issue a new write to the hardware.


- Michael


On 05/09/2013 06:45 PM, Michael R. Hines wrote:

Some more followup questions below to help me debug before I startdigging in.......
On 05/09/2013 06:20 PM, Chegu Vinod wrote:
Setting aside the mlock() freezes for the moment, let's first fix yourcrashing
problem on the destination-side. Let's make that a priority before we fix
the mlock problem.
When the migration "completes", can you provide me with more detailedinformation
about the state of QEMU on the destination?

Is it responding?
What's on the VNC console?
Is QEMU responding?
Is the network responding?
Was the VM idle? Or running an application?
Can you attach GDB to QEMU after the migration?
/usr/local/bin/qemu-system-x86_64 \
-enable-kvm \
-cpu host \
-name vm1 \
-m 131072 -smp 10,sockets=1,cores=10,threads=1 \
-mem-path /dev/hugepages \
Can you disable hugepages and re-test?
I'll get back to the other mlock() issues later after we at leastfirst make sure the migration itself is working.....

Re: [Qemu-devel] [PATCH v6 00/11] rdma: migration support

Reply via email to