Bug#927767: e2fsck: Needs to clear all MMP flags in one pass

Theodore Ts'o Tue, 23 Apr 2019 09:03:21 -0700

On Mon, Apr 22, 2019 at 07:34:40PM -0700, Elliott Mitchell wrote:
> 
> This is several VMs on a hypervisor.  Most filesystems aren't shared, I'm
> mostly using MMP for protection against my doing something stupid due to
> typing a command with the wrong device.  The VMs have separate /, and
> /var.  Their storage isn't being presented to them as a single disk, but
> instead being divided into block devices for filesystems in the
> privileged area.
> 
> The privileged area understands ext4 and I'm mostly worried about
> mistakenly trying to mount a filesystem in the privileged area while a VM
> is active on it.  There is no fallover, mostly protection against
> fumble-fingers.  %-)


MMP was designed for the use case where there were two servers that
connected to the same disk, using something like SCSI or Fibre
Channel.  The idea was one server would be the primary, and the other
would be the backup, and both could simultaneously access the disk ---
but of course, it wouldn't be safe for the two computers to try to run
e2fsck or mount the file system at the same time.

This is why clearing MMP is so slow; the primary computer (the one
holding the MMP "lock") writes to the MMP block once every 5 seconds,
updating a sequence counter.  The secondary computer, if it is trying
to acquire the lock, has to read the MMP block, sleep for 11 seconds
(in case the primary computer's update gets delayed due to the disk
being super busy, etc.), read tghe MMP block again, and if the MMP
block hasn't changed, it knows it's safe to grab the MMP lock.

It's actually a bit more complicated than that, since we also need to
deal with the situation where the primary computer is "dead" for say,
15 seconds, and then comes back to life.  This might happen if the
system is thrashing due to high memory pressure, and then the OOM
killer kills off some processes, and the system is restored.  So in
fact, the holder of the MMP lock has to query the MMP block before it
updates it, and if it finds the MMP lock has been stolen out from
under it, it's because it failed to update the MMP block, and so it
has to relinquish the lock, forcibly unmount the file system, and
assume the backup system has take over.

The problem though is while you don't *expect* to do a failover, two
systems can simultaneously access the device, and so you need to have
most of this logic.  Worse, if your use case includes the hypervisor
suspending a VM for an hour or two, before resuming it later, MMP
won't even protect you against the fumble finger case.

The real answer this sort of lockout should be implemented at the
hypervisor layer, since it should know, from its control plane,
whether or not two VM's are trying to use a particular virtual disk.
For example, Google Compute Engine's Persistent Disk product enforces
this restriction by default.  If you are using a regional persistent
disk, you can set up a standby VM in a different GCE zone, and in case
of a zonal failure, you can use a force-attach command[1] to steal the PD
from the original VM and connect it to the backup VM.  Then if the GCE
zone comes back to life (or the network partition gets healed, etc.),
the GCE control plane will prevent the original VM from accessing the
persistent disk which was stolen away from it.  (Disclosure: I work
for Google.)

[1] 
https://cloud.google.com/compute/docs/disks/regional-persistent-disk#force_attach

MMP was designed for the use case where the shared SCSI or FC disk
didn't have this kind of control plane logic, but in fact, it's
actually *way* more efficient to implement this feature above the
layer of the guest VM.

Cheers,

                                        - Ted

Bug#927767: e2fsck: Needs to clear *all* MMP flags in one pass

Reply via email to

Bug#927767: e2fsck: Needs to clear all MMP flags in one pass