As ocfs2 heartbeats on the same device, unplugging a different device on the
storage should not affect ocfs2 as long as the ios are completing. But
the logs
indicate otherwise. HB ios are erroring out.
The o2net message is the tcp connect message. We will be providing a way
to configure that too.
Peter Santos wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Suni,
after trying to chase this down, I think one of our sa's might have restarted
the storage without
notifying anyone.
Similarly, today a disk that was not in use was re-initialized and caused
everything to come down. I don't
know if this is an issue with ocfs2 or ( old_storage + our sa doing this
incorrectly).
The idea was to re-initialize a disk that was not being used (sdc) and not have
it affect
the ocfs2 storage (sdb).
After the re-initialization completed, I noticed that all 3 nodes weren't
working and this was
what I found on dbo3
=======================================================================================================================
Nov 21 11:40:36 dbo3 kernel: o2net: connection to node dbo2 (num 1) at
192.168.134.141:7777 has been idle for 10
seconds, shutting it down.
Nov 21 11:40:36 dbo3 kernel: (0,1):o2net_idle_timer:1310 here are some times
that might help debug the situation: (tmr
1164127226.293816 now 1164127236.291931 dr 1164127226.293797 adv
1164127226.293818:1164127226.293819 func (a77953f3:2)
1164124426.747626:1164124426.747628)
Nov 21 11:40:36 dbo3 kernel: o2net: no longer connected to node dbo2 (num 1) at
192.168.134.141:7777
Nov 21 11:41:11 dbo3 kernel: SCSI error : <1 0 0 0> return code = 0x10000
Nov 21 11:41:11 dbo3 kernel: end_request: I/O error, dev sdb, sector 591502543
Nov 21 11:41:11 dbo3 kernel: SCSI error : <1 0 0 0> return code = 0x10000
...
Nov 21 11:41:11 dbo3 kernel: SCSI error : <1 0 0 0> return code = 0x10000
Nov 21 11:41:11 dbo3 kernel: end_request: I/O error, dev sdb, sector 591502568
Nov 21 11:41:11 dbo3 kernel: (3711,0):o2hb_do_disk_heartbeat:954 ERROR: status
= -5
Nov 21 11:41:11 dbo3 kernel: (3789,0):o2hb_do_disk_heartbeat:954 ERROR: status
= -5
Nov 21 11:41:11 dbo3 kernel: SCSI error : <1 0 0 0> return code = 0x10000
Nov 21 11:41:11 dbo3 kernel: end_request: I/O error, dev sdb, sector 1983
Nov 21 11:41:11 dbo3 kernel: (6614,0):o2hb_bio_end_io:332 ERROR: IO Error -5
Nov 21 11:41:11 dbo3 kernel: SCSI error : <1 0 0 0> return code = 0x10000
Nov 21 11:41:11 dbo3 kernel: end_request: I/O error, dev sdb, sector 3921780
Nov 21 11:41:11 dbo3 kernel: (6614,0):o2hb_bio_end_io:332 ERROR: IO Error -5
Nov 21 11:41:11 dbo3 kernel: (3711,0):o2hb_do_disk_heartbeat:954 ERROR: status
= -5
Nov 21 11:41:11 dbo3 kernel: (3789,0):o2hb_do_disk_heartbeat:954 ERROR: status
= -5
...
Nov 21 11:41:11 dbo3 kernel: (3711,0):o2hb_do_disk_heartbeat:954 ERROR: status
= -5
Nov 21 11:41:11 dbo3 kernel: (3789,0):o2hb_do_disk_heartbeat:954 ERROR: status
= -5
Nov 21 11:41:11 dbo3 su: pam_unix2: session finished for user oracle, service su
Nov 21 11:41:11 dbo3 logger: Oracle CSSD failure 134.
Nov 21 11:45:07 dbo3 syslogd 1.4.1: restart.
I'm curious about the message
"o2net: connection to node dbo2 (num 1) at 192.168.134.141:7777 has been idle for 10
seconds, shutting it down."
I have increased my O2CB_HEARTBEAT_THRESHOLD to 61, but where is this message getting
"10 seconds" from?
Also this message is displayed because dbo2 was not able to check into the
hearbeat filesystem right ?
- -peter
Sunil Mushran wrote:
On nodes db01 and db03 hb timed-out at 17:12:49. However, the nodes
did not fully panic. As in, the network was shutdown but the hb thread
was still going strong for some reason.
Within 10 secs of that, by 17:12:59, db02 detected loss of network
connectivity with both nodes db01 and db03. However, it was still
seeing the nodes hb on disk and assumed that they were alive. As per
quorum rules, it paniced.
So the qs is: what was happening on nodes db01 and db03 after 17:12:49?
Peter Santos wrote:
Folks,
I'm trying to piece together what happened during a recent event where
our 3 node RAC cluster had problems.
It appears that all 3 nodes restarted .. which is likely to occur if
all 3 nodes cannot communicate with the
shared ocfs2 storage.
I did find out from our SA, that this happened during the time he was
replacing a failed drive on the storage
and the storage was in a degraded mode. I'm trying to understand if
the 3 nodes had a difficult time accessing
the shared ocfs2 volume or was it a tcp connectivity issue. There is
nobody currently using the cluster ..so
it should have been idle from a user perspective.
prompt># cat /etc/fstab | grep ocfs2
/dev/sdb1 /ocfs2 ocfs2 _netdev,datavolume,nointr 0 0
/dev/sdb2 /backups ocfs2 _netdev,datavolume,nointr 0 0
we have 2 ocfs2 volumes.. once if for the voting and ocr files, while
the other is to be used as a
shared storage for backups of archivelog files etc.
/var/log/messages
NODE1 (dbo1)
========================================================================================================
Nov 15 17:12:49 dbo1 kernel: (13,3):o2hb_write_timeout:270 ERROR:
Heartbeat write timeout to device sdb2
after 12000 milliseconds
Nov 15 17:12:49 dbo1 kernel: Heartbeat thread (13) printing last 24
blocking operations (cur = 13):
Nov 16 05:44:58 dbo1 syslogd 1.4.1: restart.
NODE2 (dbo2)
========================================================================================================
Nov 15 17:12:57 dbo2 kernel: o2net: connection to node dbo1 (num 0) at
192.168.134.140:7777 has been idle for 10
seconds, shutting it down.
Nov 15 17:12:57 dbo2 kernel: (0,1):o2net_idle_timer:1310 here are some
times that might help debug the situation: (tmr
1163628767.826089 now 1163628777.825614 dr 1163628767.826070 adv
1163628767.826104:1163628767.826105 func (f0735f96
:506) 1163454320.893701:1163454320.893708)
Nov 15 17:12:57 dbo2 kernel: o2net: no longer connected to node dbo1
(num 0) at 192.168.134.140:7777
Nov 15 17:12:59 dbo2 kernel: o2net: connection to node dbo3 (num 2) at
192.168.134.142:7777 has been idle for 10
seconds, shutting it down.
Nov 15 17:12:59 dbo2 kernel: (0,1):o2net_idle_timer:1310 here are some
times that might help debug the situation: (tmr
1163628769.44144 now 1163628779.43640 dr 1163628769.44123 adv
1163628769.44159:1163628769.44160 func (f7e0383f:504)
1163540424.444236:1163540424.444248)
Nov 15 17:12:59 dbo2 kernel: o2net: no longer connected to node dbo3
(num 2) at 192.168.134.142:7777
Nov 15 17:32:37 dbo2 -- MARK --
Nov 15 17:33:03 dbo2 kernel: (11,1):o2quo_make_decision:121 ERROR:
fencing this node because it is only connected to 1
nodes and 2 is needed to make a quorum out of 3 heartbeating nodes
Nov 15 17:33:03 dbo2 kernel: (11,1):o2hb_stop_all_regions:1889 ERROR:
stopping heartbeat on all active regions.
Nov 15 17:33:03 dbo2 kernel: Kernel panic: ocfs2 is very sorry to be
fencing this system by panicing
Nov 15 17:33:03 dbo2 kernel:
NODE3 (dbo3)
========================================================================================================
Nov 15 17:12:49 dbo3 kernel: (13,3):o2hb_write_timeout:270 ERROR:
Heartbeat write timeout to device sdb2
after 12000 milliseconds
Nov 15 17:12:49 dbo3 kernel: Heartbeat thread (13) printing last 24
blocking operations (cur = 11):
Nov 16 10:45:32 dbo3 syslogd 1.4.1: restart.
any help is greatly appreciated (BTW, I've read the ocfs2 user guide).
thanks
-peter
_______________________________________________
Ocfs2-users mailing list
[email protected]
http://oss.oracle.com/mailman/listinfo/ocfs2-users
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFFYzlCoyy5QBCjoT0RAp5hAJ9tQfMhZKnXmZC4+WwKkN7qpey/4QCeImS0
W6wm2WuTikOoZJxvpjMhxy0=
=4IZ/
-----END PGP SIGNATURE-----
_______________________________________________
Ocfs2-users mailing list
[email protected]
http://oss.oracle.com/mailman/listinfo/ocfs2-users