[Gluster-users] Just coming out of a nightmare scenario

Gerald Brandt Fri, 16 Nov 2012 10:23:55 -0800

Wow.  I have 2 replica servers that host VM's via GlusterNFS.  uCARP handles 
the IP failover if one system dies.


Both systems were running fine.  In fact, I was logged into the backup NFS 
server watching a large file get created on a RAID-6 array exported via Gluster.

NAS-1 - primary GlusterNFS server
NAS-2 - backup GlusterNFS server


Suddenly, all VM's stopped responding.  NAS-1 showed 400% CPU usage (4 cores at 
100%).  I waited about 30 seconds to see if things would come back to normal, 
but no go.  I shut down NAS-1 in order to let the failover take place, and 
NAS-2 to come on-line.

NAS-2 grabbed the IP address, but my Citrix XenServers were not reconnecting.  
oh oh.  I reboot NAS-1 to bring it back up, and it boot into initramfs.  Crap.

I kept monitoring NAS-2, trying to figure out what was going on.  Ten minutes 
later, I realized NAS-2 had lost 4 of the 6 drives in its RAID-6 array.  Double 
crap.  The GlusterNFS server kept returning errors, since the RAID device was 
really no longer there.

Did some Google searching, and ended up typing 'exit' at the initramfs prompt 
on NAS-1.  The system came up fine.  I killed NAS-2 so the IP would fail 
over/back.  XenServer reconnected and all the VM's needed rebooting.  15 
minutes with no disk is too long.

I fired up NAS-2 to see what the heck was going on.  I lost a motherboard SATA 
controller.  The timing could not have been worse.  I moved the drives to a 
PCIe SATA card, boot the system, rebuilt the RAID, and voila, everything is 
back up and syncing.

Just goes to show, even tested failovers can fail.

Not a fun evening.

Gerald
_______________________________________________
Gluster-users mailing list
[email protected]
http://supercolony.gluster.org/mailman/listinfo/gluster-users

[Gluster-users] Just coming out of a nightmare scenario

Reply via email to