On 09/17/2014 07:25 PM, Paolo Bonzini wrote: > Il 17/09/2014 11:06, Stefan Hajnoczi ha scritto: >> I think the fundamental problem here is that the mirror block job >> on the source host does not synchronize with live migration. > >> Remember the mirror block job iterates on the dirty bitmap >> whenever it feels like. > >> There is no guarantee that the mirror block job has quiesced before >> migration handover takes place, right? > > Libvirt does that. Migration is started only once storage mirroring > is out of the bulk phase, and the handover looks like: > > 1) migration completes > > 2) because the source VM is stopped, the disk has quiesced on the source > > 3) libvirt sends block-job-complete > > 4) libvirt receives BLOCK_JOB_COMPLETED. The disk has now quiesced on > the destination as well. > > 5) the VM is started on the destination > > 6) the NBD server is stopped on the destination and the source VM is quit. > > It is actually a feature that storage migration is completed > asynchronously with respect to RAM migration. The problem is that > qcow2_invalidate_cache happens between (3) and (5), and it doesn't > like the concurrent I/O received by the NBD server.
How can it happen at all? I thought there are 2 channels/sockets - one for live migration, one for NBD and they concur, nope? btw any better idea of a hack to try? Testers are pushing me - they want to upgrade the broken setup and I am blocking them :) Thanks! -- Alexey