Re: [Linux-HA] Failed RAID leads to failed cluster

Andreas Kurz Tue, 20 Mar 2012 06:42:41 -0700

On 03/20/2012 10:41 AM, Christoph Bartoschek wrote:
> Hi,
> 
> we have a two node nfs server setup.  Each node has a RAID 6 with 
> Adaptec hardware controller. DRBD synchronizes the blockdevice. On top 
> of it there is a nfs server.
> 
> Today our RAID controller failed on the master to rebuild after one 
> harddisk had crashed and the device /dev/sdb1 got unavailable 
> temporarily. I assume this is the case because of the following messages:
> 
> Mar 20 04:01:58 laplace kernel: [1786373.892141] sd 0:0:1:0: [sdb] Very 
> big device. Trying to use READ CAPACITY(16).
> Mar 20 04:05:47 laplace kernel: [1786602.053040] block drbd1: peer( 
> Secondary -> Unknown ) conn( Connected -> TearDown ) pdsk( UpToDate -> 
> Outdated )
> 
> The cluster then detected failure and tried to promote the slave and to 
> demote the master. This failed because lvm timeout out to get stopped on 
> the master. I assume it tried to write something to the drbd device and 
> failed resulting in the timeout.
> 
> 
> So my question is. What are we doing wrong? And how can we prevent the 
> failure of the whole cluster in such a situation?


Please share your drbd and cluster configuration ... two lines from log
are not really enough to make suggestions based on facts.

Best Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

> 
> 
> Thanks
> Christoph
> 
> 
> 
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems

signature.asc
Description: OpenPGP digital signature

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Failed RAID leads to failed cluster

Reply via email to