On 03/20/2012 10:41 AM, Christoph Bartoschek wrote: > Hi, > > we have a two node nfs server setup. Each node has a RAID 6 with > Adaptec hardware controller. DRBD synchronizes the blockdevice. On top > of it there is a nfs server. > > Today our RAID controller failed on the master to rebuild after one > harddisk had crashed and the device /dev/sdb1 got unavailable > temporarily. I assume this is the case because of the following messages: > > Mar 20 04:01:58 laplace kernel: [1786373.892141] sd 0:0:1:0: [sdb] Very > big device. Trying to use READ CAPACITY(16). > Mar 20 04:05:47 laplace kernel: [1786602.053040] block drbd1: peer( > Secondary -> Unknown ) conn( Connected -> TearDown ) pdsk( UpToDate -> > Outdated ) > > The cluster then detected failure and tried to promote the slave and to > demote the master. This failed because lvm timeout out to get stopped on > the master. I assume it tried to write something to the drbd device and > failed resulting in the timeout. > > > So my question is. What are we doing wrong? And how can we prevent the > failure of the whole cluster in such a situation?
Please share your drbd and cluster configuration ... two lines from log are not really enough to make suggestions based on facts. Best Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now > > > Thanks > Christoph > > > > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
