Ubuntu 10.10)

Sunil Mushran Fri, 01 Apr 2011 12:02:39 -0700

I believe this is a pacemaker issue. There was a time it required a
qdisk to continue working as a single node in a 2 node cluster when
one node died. if pacemaker people don't jump in, you may want to
try your luck in the linux-cluster mailing list.


On 04/01/2011 11:44 AM, Mike Reid wrote:

I am running a two-node web cluster on OCFS2 via DRBD Primary/Primary (v8.3.8) 
and Pacemaker. Everything  seems to be working great, except during testing of 
hard-boot scenarios.

Whenever I hard-boot one of the nodes, the other node is successfully fenced and marked 
"Outdated"

    * <resource minor="0" cs="WFConnection" ro1="Primary" ro2="*Unknown*" ds1="UpToDate" 
ds2="*Outdated*" />


However, this locks up I/O on the still active node and prevents any operations 
within the cluster :(
I have even forced DRBD into StandAlone mode while in this state, but that does 
not resolve the I/O lock either.

    * <resource minor="0" cs="*StandAlone*" ro1="*Primary*" ro2="Unknown" ds1="*UpToDate*" 
ds2="Outdated" />


The only way I've been able to successfully regain I/O within the cluster is to 
bring back up the other node. While monitoring the logs, it seems that it is 
OCFS2 that's establishing the lock/unlock and /not/ DRBD at all.



    Apr  1 12:07:19 ubu10a kernel: [ 1352.739777] 
(ocfs2rec,3643,0):ocfs2_replay_journal:1605 Recovering node 1124116672 from 
slot 1 on device (147,0)
    Apr  1 12:07:19 ubu10a kernel: [ 1352.900874] 
(ocfs2rec,3643,0):ocfs2_begin_quota_recovery:407 Beginning quota recovery in 
slot 1
    Apr  1 12:07:19 ubu10a kernel: [ 1352.902509] 
(ocfs2_wq,1213,0):ocfs2_finish_quota_recovery:598 Finishing quota recovery in 
slot 1

    Apr  1 12:07:20 ubu10a kernel: [ 1354.423915] block drbd0: Handshake 
successful: Agreed network protocol version 94
    Apr  1 12:07:20 ubu10a kernel: [ 1354.433074] block drbd0: Peer 
authenticated using 20 bytes of 'sha1' HMAC
    Apr  1 12:07:20 ubu10a kernel: [ 1354.433083] block drbd0: conn( WFConnection 
-> WFReportParams )
    Apr  1 12:07:20 ubu10a kernel: [ 1354.433097] block drbd0: Starting asender 
thread (from drbd0_receiver [2145])
    Apr  1 12:07:20 ubu10a kernel: [ 1354.433562] block drbd0: data-integrity-alg: 
<not-used>
    Apr  1 12:07:20 ubu10a kernel: [ 1354.434090] block drbd0: 
drbd_sync_handshake:
    Apr  1 12:07:20 ubu10a kernel: [ 1354.434094] block drbd0: self 
FBA98A2F89E05B83:EE17466F4DEC2F8B:6A4CD8FDD0562FA1:EC7831379B78B997 bits:4 
flags:0
    Apr  1 12:07:20 ubu10a kernel: [ 1354.434097] block drbd0: peer 
EE17466F4DEC2F8A:0000000000000000:6A4CD8FDD0562FA0:EC7831379B78B997 bits:2048 
flags:2
    Apr  1 12:07:20 ubu10a kernel: [ 1354.434099] block drbd0: uuid_compare()=1 
by rule 70
    Apr  1 12:07:20 ubu10a kernel: [ 1354.434104] block drbd0: peer( Unknown -> 
Secondary ) conn( WFReportParams -> WFBitMapS )
    Apr  1 12:07:21 ubu10a kernel: [ 1354.601353] block drbd0: conn( WFBitMapS -> 
SyncSource ) pdsk( Outdated -> Inconsistent )
    Apr  1 12:07:21 ubu10a kernel: [ 1354.601367] block drbd0: Began resync as 
SyncSource (will sync 8192 KB [2048 bits set]).
    Apr  1 12:07:21 ubu10a kernel: [ 1355.401912] block drbd0: Resync done 
(total 1 sec; paused 0 sec; 8192 K/sec)
    Apr  1 12:07:21 ubu10a kernel: [ 1355.401923] block drbd0: conn( SyncSource -> 
Connected ) pdsk( Inconsistent -> UpToDate )
    Apr  1 12:07:22 ubu10a kernel: [ 1355.612601] block drbd0: peer( Secondary 
-> Primary )


Therefore, my question is if there is an option in OCFS2 to remove / prevent 
this lock, especially since it's inside a DRBD configuration? I'm still new to 
OCFS2, so I am definitely open to any criticism regarding my setup/approach, or 
any recommendations related to keeping the cluster active when another node is 
shutdown during testing.


_______________________________________________
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

_______________________________________________
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Node Recovery locks I/O in two-node OCFS2 cluster (DRBD 8.3.8 / Ubuntu 10.10)

Reply via email to