We just installed a new cluster with 6 HP DL380g5, dual single port Qlogic 24xx 
HBAs connected via two HP 4/16 Storageworks switches to a 3Par S400. We are 
using the 3Par recommended config for the Qlogic driver and 
device-mapper-multipath giving us 4 paths to the SAN. We do see some SCSI 
errors where DM-MP is failing a path after get a 0x2000 error from the SAN 
controller, but the path gets puts back in service in less then 10 seconds.

This needs to be fixed but I don't think it is what is causing our reboots. 2 
of the nodes rebooted once while being idle (ocfs2 and clusterware were 
running, no db) and one node rebooted while idle (another node was copying 
using fscat our 9i db from ocfs1 to the ocfs2 data volume) and once while some 
load was put on it via the upgraded 10g database. In all cases it is as if 
someone a hardware reset button. No kernel panic (at least not one leading to a 
stop with visable message), we can get a dirty write cache for the internal 
cciss controller. 

The only messages we get on the nodes are when the crashed node is already in 
reset and it missed its ocfs2 heartbeat (set to the default of 7), followed 
later by crs moving the vip.

Any hints on trouble shooting this would be appreciated.

Regards, Ulf.


--------------------------
Sent from my BlackBerry Wireless Handheld
       
_______________________________________________
Ocfs2-users mailing list
[email protected]
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Reply via email to