Hi, we have a two node nfs server setup. Each node has a RAID 6 with Adaptec hardware controller. DRBD synchronizes the blockdevice. On top of it there is a nfs server.
Today our RAID controller failed on the master to rebuild after one harddisk had crashed and the device /dev/sdb1 got unavailable temporarily. I assume this is the case because of the following messages: Mar 20 04:01:58 laplace kernel: [1786373.892141] sd 0:0:1:0: [sdb] Very big device. Trying to use READ CAPACITY(16). Mar 20 04:05:47 laplace kernel: [1786602.053040] block drbd1: peer( Secondary -> Unknown ) conn( Connected -> TearDown ) pdsk( UpToDate -> Outdated ) The cluster then detected failure and tried to promote the slave and to demote the master. This failed because lvm timeout out to get stopped on the master. I assume it tried to write something to the drbd device and failed resulting in the timeout. So my question is. What are we doing wrong? And how can we prevent the failure of the whole cluster in such a situation? Thanks Christoph _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
