On Fri, 19 Jun 2009, Derek J. Balling wrote: > On Jun 19, 2009, at 4:43 PM, [email protected] wrote: >> my big problem with live migration (especially as a disaster recovery >> 'solution') is that if the running machine dies it's too late to do a live >> migration. If the application is important enough to need failover and >> disaster recovery I need it to be able to survive a system just >> disappearing, and so I need it to be able to recover on the new machine >> without having the old machine available to migrate from, and if I have >> that anyway, why not use that instead of live migration? > > That's also a feature in the new version of ESX, what they call "constant > availability", where the state of the VM is maintained on two different ESX > hosts simultaneously. If the "live" one fails, the "standby" unit takes over. > If the standby unit fails, a different unit in the resource pool takes over > as "standby" and assumes the responsibility of being "available".
I claim that it's not possible for a two servers to have the same state. or at least not with acceptable performance. the problem is that the bandwidth available to transfer state between machines is a trickle compared to the bandwidth available inside a machine to change the state. even if you limit your definition of 'state' to the contents of the logical disks, you can't have real-time replication of disk contents between machines at anything close to the same speed that you can make changes to the local disk. and when you then have applications believe the vmware kool-aid that this replication will save them and don't bother to write and fsync their data to disk, but instead keep it in memory, the local bandwidth available is another several orders of magnatude faster than the interconnect bandwidth. now, for most applications, the amount of data that _needs_ to be replicated for the backup box to be able to pick up processing is a tiny fraction of this, and can easily be handled by network (including WAN) bandwidth, but only the application can know what tiny fraction of all the changes that it makes to memory or disk actually need to be replicated, the OS/VM level can't know this and can only try to replicate everything. these bandwidth limits mean that even 'live migration' doesn't mean zero outage, at some point you need to pause the VM on one machine to copy the last of the changes to the new machine and start it up. vmware takes advantage of the fact that most memory/disk id not normally changed, so it copies everything, and then goes back and copies everything that has changed, until it decides that it's not making sufficiant progress, at which time it must pause the app to move it. This normally makes the outage small enough that most people don't notice it, but it's not a case of 'not a moment of outage' as another poster in this thread commented. >> I can see live migration as being handy for maintinance and planned >> changes, but it's not _that_ hard to plan to do the failover at off-peak >> times when a few seconds of outage aren't a problem. > > It's all about "what can your environment handle". For some environments, a > couple seconds of outage is fine. For others, that's completely not > acceptable. You have to plan your budget dollars in indirect proportion to > the amount of downtime you're willing to accept. The less downtime, the more > it costs. :-) and at some point simple bandwidth and latency (speed of light) limits mean that you can't eliminate all downtime. David Lang > Cheers, > D > > _______________________________________________ Discuss mailing list [email protected] http://lopsa.org/cgi-bin/mailman/listinfo/discuss This list provided by the League of Professional System Administrators http://lopsa.org/
