[Linux-HA] Failed RAID leads to failed cluster

Christoph Bartoschek Tue, 20 Mar 2012 02:41:57 -0700

Hi,

we have a two node nfs server setup.  Each node has a RAID 6 with 
Adaptec hardware controller. DRBD synchronizes the blockdevice. On top 
of it there is a nfs server.


Today our RAID controller failed on the master to rebuild after one 
harddisk had crashed and the device /dev/sdb1 got unavailable 
temporarily. I assume this is the case because of the following messages:

Mar 20 04:01:58 laplace kernel: [1786373.892141] sd 0:0:1:0: [sdb] Very 
big device. Trying to use READ CAPACITY(16).
Mar 20 04:05:47 laplace kernel: [1786602.053040] block drbd1: peer( 
Secondary -> Unknown ) conn( Connected -> TearDown ) pdsk( UpToDate -> 
Outdated )

The cluster then detected failure and tried to promote the slave and to 
demote the master. This failed because lvm timeout out to get stopped on 
the master. I assume it tried to write something to the drbd device and 
failed resulting in the timeout.


So my question is. What are we doing wrong? And how can we prevent the 
failure of the whole cluster in such a situation?


Thanks
Christoph



_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] Failed RAID leads to failed cluster

Reply via email to