Re: [Ocfs2-users] Troubles with two node

Sunil Mushran Thu, 29 Nov 2007 10:43:29 -0800

To elaborate on this. Prior to 1.2.5, we used to hear complaints
about a frozen node causing processes on other functioning nodes
go D state, presumably while it was accessing the fs.


There were two reasons for the same. First was the fencing method.
We used to panic() which at times would not reset the box. In these
cases, the node would freeze, but the disk hb thread would keep
chugging along. The D state processes on the other nodes would
be waiting for that node to stop heartbeating. Power off/on would
solve the issue. This was resolved in 1.2.5 when we changed the
fencing call from panic() to machine_restart().

The second reason were the insanely low default cluster timeouts
leading to unnecessary fencing. This was partially resolved in 1.2.5
when we allowed custom values for all cluster timeouts. In 1.2.6/1.2.7,
we upped the default timeouts to more saner values.

So, let's start with the kernel version as that will atleast narrow
down the known issues.

Sunil Mushran wrote:

What's the kernel version#?

inode wrote:

Hi all,

I'm running OCFS2 on two system with OpenSUSE 10.2 connected on fibre
channel with a shared storage (HP MSA1500 + HP PROLIANT MSA20).

The cluster has two node (web-ha1 and web-ha2), sometimes (1 or 2 times
on a month) the OCFS2 stop to work on both system. On the first node I'm
getting no error in log files and after a forced shoutdown of the first
node on the second I can see the logs on the bottom of this message.

I saw some other people is getting on a similar trouble
(http://www.mail-archive.com/[email protected]/msg01135.html)
but the thread don't gave me help...

Anyone has any idea?

Thanks you in advance.

Maurizio


web-ha1:~ # cat /etc/sysconfig/o2cb

O2CB_ENABLED=true
O2CB_BOOTCLUSTER=ocfs2
O2CB_HEARTBEAT_THRESHOLD=451

web-ha1:~ #
web-ha1:~ # cat /etc/ocfs2/cluster.conf
node:
        ip_port = 7777
        ip_address = 192.168.255.1
        number = 0
        name = web-ha1
        cluster = ocfs2

node:
        ip_port = 7777
        ip_address = 192.168.255.2
        number = 1
        name = web-ha2
        cluster = ocfs2

cluster:
        node_count = 2
        name = ocfs2

web-ha1:~ #



Nov 28 15:28:59 web-ha2 kernel: o2net: connection to node web-ha1 (num
0) at 192.168.255.1:7777 has been idle for 10 seconds, shutting it down.
Nov 28 15:28:59 web-ha2 kernel: (23432,0):o2net_idle_timer:1297 here are
some times that might help debug the situation: (tmr 1196260129.36511
now 1196260139
.34907 dr 1196260129.36503 adv 1196260129.36514:1196260129.36515 func
(95bc84eb:504) 1196260129.36329:1196260129.36337)
Nov 28 15:28:59 web-ha2 kernel: o2net: no longer connected to node
web-ha1 (num 0) at 192.168.255.1:7777
Nov 28 15:28:59 web-ha2 kernel: (23315,0):dlm_do_master_request:1331
ERROR: link to 0 went down!
Nov 28 15:28:59 web-ha2 kernel: (23315,0):dlm_get_lock_resource:915
ERROR: status = -112
Nov 28 15:29:18 web-ha2 sshd[23503]: pam_unix2(sshd:auth): conversation
failed
Nov 28 15:29:18 web-ha2 sshd[23503]: error: ssh_msg_send: write
Nov 28 15:29:22 web-ha2 kernel: (23396,0):dlm_do_master_request:1331
ERROR: link to 0 went down!
Nov 28 15:29:22 web-ha2 kernel: (23396,0):dlm_get_lock_resource:915
ERROR: status = -107
Nov 28 15:29:29 web-ha2 kernel: (23450,0):dlm_do_master_request:1331
ERROR: link to 0 went down!
Nov 28 15:29:29 web-ha2 kernel: (23450,0):dlm_get_lock_resource:915
ERROR: status = -107
Nov 28 15:29:46 web-ha2 kernel: (23443,0):dlm_do_master_request:1331
ERROR: link to 0 went down!
ERROR: status = -107

[...]

Nov 22 18:14:50 web-ha2 kernel: (17634,0):dlm_restart_lock_mastery:1215
ERROR: node down! 0
Nov 22 18:14:50 web-ha2 kernel: (17634,0):dlm_wait_for_lock_mastery:1036
ERROR: status = -11
Nov 22 18:14:51 web-ha2 kernel: (17619,1):dlm_restart_lock_mastery:1215
ERROR: node down! 0
Nov 22 18:14:51 web-ha2 kernel: (17619,1):dlm_wait_for_lock_mastery:1036
ERROR: status = -11
Nov 22 18:14:51 web-ha2 kernel: (17798,1):dlm_restart_lock_mastery:1215
ERROR: node down! 0
Nov 22 18:14:51 web-ha2 kernel: (17798,1):dlm_wait_for_lock_mastery:1036
ERROR: status = -11
Nov 22 18:14:51 web-ha2 kernel: (17804,1):dlm_get_lock_resource:896
86472C5C33A54FF88030591B1210C560:M0000000000000009e7e54516dd16ec: at
least one node (0) torecover before lock mastery can begin
Nov 22 18:14:51 web-ha2 kernel: (17730,1):dlm_get_lock_resource:896
86472C5C33A54FF88030591B1210C560:M0000000000000009e76bf516dd144d: at
least one node (0) torecover before lock mastery can begin
Nov 22 18:14:51 web-ha2 kernel: (17634,0):dlm_get_lock_resource:896
86472C5C33A54FF88030591B1210C560:M000000000000000ac0d22b1f78e53c: at
least one node (0) torecover before lock mastery can begin
Nov 22 18:14:51 web-ha2 kernel: (17644,1):dlm_restart_lock_mastery:1215
ERROR: node down! 0
Nov 22 18:14:51 web-ha2 kernel: (17644,1):dlm_wait_for_lock_mastery:1036
ERROR: status = -11

[...]

Nov 22 18:14:54 web-ha2 kernel: (17702,1):dlm_get_lock_resource:896
86472C5C33A54FF88030591B1210C560:M0000000000000007a6dab9ef6eacbd: at
least one node (0) torecover before lock mastery can begin
Nov 22 18:14:54 web-ha2 kernel: (17701,1):dlm_get_lock_resource:896
86472C5C33A54FF88030591B1210C560:M000000000000000a06a13716de553e: at
least one node (0) torecover before lock mastery can begin
Nov 22 18:14:54 web-ha2 kernel: (3550,0):dlm_get_lock_resource:849
86472C5C33A54FF88030591B1210C560:$RECOVERY: at least one node (0)
torecover before lock mastery can begin
Nov 22 18:14:54 web-ha2 kernel: (3550,0):dlm_get_lock_resource:876
86472C5C33A54FF88030591B1210C560: recovery map is not empty, but must
master $RECOVERY lock now
Nov 22 18:14:54 web-ha2 kernel: (17893,0):ocfs2_replay_journal:1184
Recovering node 0 from slot 0 on device (8,17)
Nov 22 18:14:55 web-ha2 kernel: (17803,1):dlm_restart_lock_mastery:1215
ERROR: node down! 0
Nov 22 18:14:55 web-ha2 kernel: (17803,1):dlm_wait_for_lock_mastery:1036
ERROR: status = -11
Nov 22 18:14:55 web-ha2 kernel: (17602,0):dlm_restart_lock_mastery:1215
ERROR: node down! 0
Nov 22 18:14:55 web-ha2 kernel: (17602,0):dlm_wait_for_lock_mastery:1036
ERROR: status = -11




_______________________________________________
Ocfs2-users mailing list
[email protected]
http://oss.oracle.com/mailman/listinfo/ocfs2-users



_______________________________________________
Ocfs2-users mailing list
[email protected]
http://oss.oracle.com/mailman/listinfo/ocfs2-users



_______________________________________________
Ocfs2-users mailing list
[email protected]
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Troubles with two node

Reply via email to