All,
I have successfully performed over 1000+ back-to-back RDMA migrations
automatically looped *in a row* using a heavy-weight memory-stress
benchmark here at IBM.
Migration success is done by capturing the actual serial console output
of the virtual machine while the benchmark is running and redirecting
each migration output to a file to verify that the output matches the
expected output of a successful migration. For half of the 1000
migrations, I used a 14GB virtual machine size (largest VM I can create)
and the remaining 500 migrations I used a 2GB virtual machine (to make
sure I was testing both 32-bit and 64-bit address boundaries). The
benchmark is configured to have 75% stores and 25% loads and is
configured to use 80% of the allocatable free memory of the VM (i.e. no
swapping allowed).
I have defined a successful migration per the output file as follows:
1. The memory benchmark is still running and active (CPU near 100% and
memory usage is high)
2. There are no kernel panics in the console output (regex keywords
"panic", "BUG", "oom", etc...)
3. The VM is still responding to network activity (pings)
4. The console is still responsive by printing periodic messages
throughout the life of the VM to the console from inside the VM using
the 'write' command in infinite loop.
With this method in a loop, I believe I've ironed out all the
regression-testing bugs that I can find. You all may find the following
bugs interesting. The original version of this patch was written in 2010
(Before my time @ IBM).
Bug #1: In the original 2010 patch, each write operation uses the same
"identifier". (A "Work Request ID" in infiniband terminology).
This is not typical (but allowed by the hardware) - and instead each
operation should have its own unique identifier so that the write
operation can be tracked properly as it completes.
Bug #2: Also in the original 2010 patch, write operations were grouped
into separate "signaled" and "unsignaled" work requests, which is also
not typical (but allowed by the hardware). "Signalling" is infiniband
terminology which means to activate/deactivate notifying the sender
whether or not the RDMA operation has already completed. (Note: the
receiver is never notified - which is what a DMA is supposed to be). In
normal operation per infiniband specifications, "unsignaled" operations
(which indicate to the hardware *not* to notify the sender of
completion) are *supposed* to be paired simultaneously with a signaled
operation using the *same* work request identifier. Instead, the
original patch was using *different* work requests for
signaled/unsignaled writes, which means that most of the writes would be
transmitted without ever being tracked for completion whatsoever. (Per
infinband specifications, signaled and unsignaled writes must be grouped
together because the hardware ensures that completion notification is
not given until *all* of the writes of the same request have actually
completed).
Bug #3: Finally, in the original 2010 patch, ordering was not being
handled. Per infiniband specifications, writes can happen completely out
of order. Not only that, but PCI-express itself can change the order of
the writes as well. It was only until after the first 2 bugs were fixed
that I could actually manifest this bug *in code*: What was happening
was that a very large group of requests would "burst" from the QEMU
migration thread. At which point, not all of the requests would finish.
Then a short time later, the next iteration would start and the virtual
machine's writable working set was still "hovering" somewhere in the
same vicinity of the address space as the previous burst of writes that
had not yet completed. When this happens, the new writes were much
smaller (not a part of a larger "chunk" per our algorithms). Since the
new writes were smaller they would complete faster than the larger,
older writes in the same address range. Since they complete out of
order, the newer writes would then get clobbered by the older writes -
resulting in an inconsistent virtual machine. So, to solve this: during
each new write, we now do a "search" to see if the address of the next
requested write matches or overlaps with the address range of any of the
previous "outstanding" writes that were still in transit, and I found
several hits. This was easily solved by blocking until the conflicting
write has completed before proceeding to issue a new write to the hardware.
- Michael
On 05/09/2013 06:45 PM, Michael R. Hines wrote:
Some more followup questions below to help me debug before I start
digging in.......
On 05/09/2013 06:20 PM, Chegu Vinod wrote:
Setting aside the mlock() freezes for the moment, let's first fix your
crashing
problem on the destination-side. Let's make that a priority before we fix
the mlock problem.
When the migration "completes", can you provide me with more detailed
information
about the state of QEMU on the destination?
Is it responding?
What's on the VNC console?
Is QEMU responding?
Is the network responding?
Was the VM idle? Or running an application?
Can you attach GDB to QEMU after the migration?
/usr/local/bin/qemu-system-x86_64 \
-enable-kvm \
-cpu host \
-name vm1 \
-m 131072 -smp 10,sockets=1,cores=10,threads=1 \
-mem-path /dev/hugepages \
Can you disable hugepages and re-test?
I'll get back to the other mlock() issues later after we at least
first make sure the migration itself is working.....