On 12/17/2013 03:52 PM, Sten Wolf wrote:
> I have 2 more questions:
>
> 1. Is dual-mgs supported with zfs? My issue seems to be mgs and mdt on
> same node, when mgs is configured for 2 nodes
> 2. Which is recommended? ldiskfs w/ 2x mdt, or zfs w/ single mdt?
>
> I assumed the llnl seqouia implementation used zfs w/ HA (dual mgs/dual
> mdt active/passive) but I might be wrong on that account.

Yes the MDS/MGS is a single node using zfs, but we do not employ 
failover for the MDS/MGS.

We have always found that Lustre software failures on the MDS/MGS are 
many, many times more common than hardware failures.  When the MDS/MGS 
crashes from a Lustre bug, we want to take the extra time to complete a 
full kernel crash dump so we have a chance of debugging the problem. 
Since we need to spend that extra time on the crash dump there is little 
advantage to moving the service to a failover partner; we just allow the 
current node to reboot.

Also important to understand is that we do not yet have multi-mount 
protection (MMP) in ZFS, so you need to take great care with your HA 
solution.  You need an extremely reliable STONITH.  If your power 
control is unreliable, you can easily wind up with multiple nodes using 
the same storage pool at the same time.  That would be very bad.

That said, we do employ failover for our OSS nodes.  Our power control 
for the OSS nodes was not as reliable as needed, so we added extra 
checks in our HA scripts to double check whether STONITH really worked 
and retry the power-off command as necessary.

We will probably reexamine our stance on MDS failover once DNE2 is 
complete and stable.  When there are multiple active MDS nodes, why not? 
  Then again, unless the software becomes a great deal more stable, 
we'll still be dependent on those slow crash dumps that we would not 
want to interrupt with a STONITH.  Also is not particularly uncommon for 
a software bug to result in a continuous crash-reboot loop.  Having HA 
in that case would just spread the problem to the failover partner node.

For now we are sticking with simple.

Chris

_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Reply via email to