On Fri, 19 Jun 2009, Derek J. Balling wrote:

> On Jun 19, 2009, at 4:43 PM, [email protected] wrote:
>> my big problem with live migration (especially as a disaster recovery 
>> 'solution') is that if the running machine dies it's too late to do a live 
>> migration. If the application is important enough to need failover and 
>> disaster recovery I need it to be able to survive a system just 
>> disappearing, and so I need it to be able to recover on the new machine 
>> without having the old machine available to migrate from, and if I have 
>> that anyway, why not use that instead of live migration?
>
> That's also a feature in the new version of ESX, what they call "constant 
> availability", where the state of the VM is maintained on two different ESX 
> hosts simultaneously. If the "live" one fails, the "standby" unit takes over. 
> If the standby unit fails, a different unit in the resource pool takes over 
> as "standby" and assumes the responsibility of being "available".

I claim that it's not possible for a two servers to have the same state. 
or at least not with acceptable performance.

the problem is that the bandwidth available to transfer state between 
machines is a trickle compared to the bandwidth available inside a machine 
to change the state.

even if you limit your definition of 'state' to the contents of the 
logical disks, you can't have real-time replication of disk contents 
between machines at anything close to the same speed that you can make 
changes to the local disk.

and when you then have applications believe the vmware kool-aid that this 
replication will save them and don't bother to write and fsync their data 
to disk, but instead keep it in memory, the local bandwidth available is 
another several orders of magnatude faster than the interconnect 
bandwidth.

now, for most applications, the amount of data that _needs_ to be 
replicated for the backup box to be able to pick up processing is a tiny 
fraction of this, and can easily be handled by network (including WAN) 
bandwidth, but only the application can know what tiny fraction of all the 
changes that it makes to memory or disk actually need to be replicated, 
the OS/VM level can't know this and can only try to replicate everything.

these bandwidth limits mean that even 'live migration' doesn't mean zero 
outage, at some point you need to pause the VM on one machine to copy the 
last of the changes to the new machine and start it up. vmware takes 
advantage of the fact that most memory/disk id not normally changed, so it 
copies everything, and then goes back and copies everything that has 
changed, until it decides that it's not making sufficiant progress, at 
which time it must pause the app to move it. This normally makes the 
outage small enough that most people don't notice it, but it's not a case 
of 'not a moment of outage' as another poster in this thread commented.

>> I can see live migration as being handy for maintinance and planned 
>> changes, but it's not _that_ hard to plan to do the failover at off-peak 
>> times when a few seconds of outage aren't a problem.
>
> It's all about "what can your environment handle". For some environments, a 
> couple seconds of outage is fine. For others, that's completely not 
> acceptable. You have to plan your budget dollars in indirect proportion to 
> the amount of downtime you're willing to accept. The less downtime, the more 
> it costs. :-)

and at some point simple bandwidth and latency (speed of light) limits 
mean that you can't eliminate all downtime.

David Lang

> Cheers,
> D
>
>
_______________________________________________
Discuss mailing list
[email protected]
http://lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Reply via email to