Additional info - node had not ANY active OCFSv2 operations (OCFSv2 used for 
backups only and from another node only). So, if system just SUSPEND all FS 
operations and try to rejoin to the cluster, it all could work (moreover, 
connection to the disk system was intact, so it could close file sytem 
gracefully).

It reveals 3 problems at once:
- single heartbeat link (instead of multiple links)
- timeout too short (ethernet can't guarantee 10 seconds, it can guarantee 1 
minute minimum);
- fencing even if system is passive and can remount / reconnect instead of 
rebooting.

All we did in the lab was _disconnect 1 of trunks between switches for a few 
seconds, then insert it back into the socket_. No one other application failed
(including heartbeat clusters). Database cluster was not doing anything on OCFS 
in time of failure (even backups).

I will try heartbeat between loopback interfaces (and OCFS protocol) next time 
(I am just curios if it can provide 10 seconds for network reconfiguration).

...
Feb  1 12:19:13 testrac12 kernel: o2net: connection to node testrac11 (num 0) 
at 10.254.32.111:7777 has been idle for 10 seconds, shutting it down. 
Feb  1 12:19:13 testrac12 kernel: (13,3):o2net_idle_timer:1310 here are some 
times that might help debug the situation: (tmr 1170361135.521061 now 
1170361145.520476 dr 1170361141.852795 adv 1170361135.521063:1170361135.521064 
func (c4378452:505) 1170361067.762941:1170361067.762967) 
Feb  1 12:19:13 testrac12 kernel: o2net: no longer connected to node testrac11 
(num 0) at 10.254.32.111:7777 
Feb  1 12:19:13 testrac12 kernel: (1855,3):dlm_send_remote_convert_request:398 
ERROR: status = -107 
Feb  1 12:19:13 testrac12 kernel: (1855,3):dlm_wait_for_node_death:371 
5AECFF0BBCF74F069A3B8FF79F09FB5A: waiting 5000ms for notification of death of 
node 0 
Feb  1 12:19:13 testrac12 kernel: (1855,1):dlm_send_remote_convert_request:398 
ERROR: status = -107 
Feb  1 12:19:13 testrac12 kernel: (1855,1):dlm_wait_for_node_death:371 
5AECFF0BBCF74F069A3B8FF79F09FB5A: waiting 5000ms for notification of death of 
node 0 
Feb  1 12:22:22 testrac12 kernel: (1855,2):dlm_send_remote_convert_request:398 
ERROR: status = -107 
Feb  1 12:22:22 testrac12 kernel: (1855,2):dlm_wait_for_node_death:371 
5AECFF0BBCF74F069A3B8FF79F09FB5A: waiting 5000ms for notification of death of 
node 0 
Feb  1 12:22:27 testrac12 kernel: (13,3):o2quo_make_decision:144 ERROR: fencing 
this node because it is connected to a half-quorum of 1 out of 2 nodes which 
doesn't include the lowest active node 0 
Feb  1 12:22:27 testrac12 kernel: (13,3):o2hb_stop_all_regions:1889 ERROR: 
stopping heartbeat on all active regions. 
Feb  1 12:22:27 testrac12 kernel: Kernel panic: ocfs2 is very sorry to be 
fencing this system by panicing 
Feb  1 12:22:27 testrac12 kernel: 
Feb  1 12:22:28 testrac12 su: pam_unix2: session finished for user oracle, 
service su 
Feb  1 12:22:29 testrac12 logger: Oracle CSSD failure.  Rebooting for cluster 
integrity. 
Feb  1 12:22:32 testrac12 su: pam_unix2: session finished for user oracle, 
service su 
...
_______________________________________________
Ocfs2-users mailing list
[email protected]
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Reply via email to