> On Tue, Oct 15, 2013 at 03:26:19PM +0800, Jules Wang wrote: > > v2 -> v3: > > * add documentation of new option in qapi-schema. > > > > * long option name: ft -> fault-tolerant > > > > v1 -> v2: > > * cmdline: migrate curling:tcp:<address>:<port> > > -> migrate -f tcp:<address>:<port> > > > > * sender: use QEMU_VM_FILE_MAGIC_FT as the header of the migration > > to indicate this is a ft migration. > > > > * receiver: look for the signature: > > QEMU_VM_EOF_MAGIC + QEMU_VM_FILE_MAGIC_FT(64bit total) > > which indicates the end of one migration. > > -- > > Jules Wang (4): > > Curling: add doc > > Curling: cmdline interface. > > Curling: the sender > > Curling: the receiver >
First of all, thanks for your superb and spot-on comments. > It would be helpful to clarify the status of Curling in the cover letter > email so reviewers know what to expect. OK, but I'm not quite clear about how to clarify the status, would you pls give me an example? > > This series does not address I/O or failover. I guess you are aware of > the missing topics that I mentioned, here are my thoughts on them: > > I/O needs to be held back until the destination host has acknowledged > receiving the last full migration state. The outside world cannot > witness state changes in the guest until the migration state has been > successfully transferred to the destination host. Otherwise the guest > may appear to act incorrectly when resuming execution from the last > snapshot. > > The time period used by the FT sender thread determines how much latency > is added to I/O requests. Yes, there is the latency. That is inevitable. I guess you mean the following situation: If a msg 'hello' is sent to the chat room server just a few seconds before the failover happens, there is a possibility that the msg will be sent to the others twice or be lost. Am I right? > > Failover functionality is missing from these patches. We cannot simply > start executing on the destination host when the migration connection > ends. If the guest disk image is located on shared storage then > split-brain occurs when a network error terminates the migration > connection - > will both hosts begin accessing the shared disk? YES > I have a simple way to handle that. In one word, the third point --gateway. Both the sender and the receiver check the connectivity to the gateway every X seconds. Let's use A and B stand for whether the sender and the receiver are connected to the gateway respectively. When the connection between the sender and the receiver is down. A && B is false. If A is false, the vm instance at the sender will be stopped. If B is false, the vm instance at the receiver will not be started. a.A false B false: 0 vm run b.A false B true: 1 vm run c.A true B false: 1 vm run d.A true B true : 1 vm run (normal case) It becomes complicated when we consider the state transitions in these four states. I suggest adding this feature to libvirt instead of qemu. > What is your plan to address these issues? > > Stefan >