After confirming with Stephan, this problem appears to relate to the HEARTBEAT_THRESHOLD parameter as set in /etc/sysconfig/o2cb. After encountering this myself and having confirmed with a couple of other people in the list that it has caused problems, it seems that the default threshold of 7 is possibly too short, even in reasonably fast server-storage solutions such as an HP DL380 Packaged Cluster.
Does the OCFS2 development team also consider this to be too short, or is altering the paramater just a workaround that shouldn't be used? If this is the case then how should we approach the problem of self-fencing nodes? Also, can we expect this behaviour with some platforms but not others, or is it too short for all platforms? If it is a blanket problem, then should the default threshold be raised? Finally, if the altering the threshold is a valid solution, could it please be added to the FAQs and the user guide so that people know to adjust it as a first step on encountering the problem, rather than having to post to the list and wait for replies. Regards, Gavin -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Stephan A. Rickauer Sent: Thursday, 30 March 2006 00:47 To: [email protected] Subject: [Ocfs2-users] heartbeat write timeout Dear list, I am evaluating ocfs2 in a test environment, that currently runs a "cluster" in a one node mode (AMD Opteron, 2GB RAM, RH AS4 (CentOS 4.3), 2.6.9-34.EL) connected to an iSCSI storage device. While doing load tests with 'bonnie++' to test the performance of the storage device together with the file system I experience regular kernel panics related to ocfs2 (1.2.0 RPMs). Here is the message I get (I did not want to file a bug yet, maybe it's just me missing something). sdb1 is the iscsi device: ---snip--- (3,0):o2hb_write_timeout: 164 ERROR: Heartbeat write timeout to device sdb1 after 12000 milliseconds (3,0):02hb_stop_all_regions: 1727 ERROR: stopping heartbeat on all active regions Kernel panic - not syncing: ocfs2 is very sorry to be fencing this system by panicing ---snip--- I am tempted to rule out iscsi storage device related problems, but this is not 100% sure, though tests with GFS and ext3 did not reveal comparable problems. On the bug page I spotted ID565 which seems to fit my szenario, but the status of the bug is unclear to me (references to version 0.99 are given): http://oss.oracle.com/bugzilla/show_bug.cgi?id=565 Any help / comments etc. are appreciated. Thanks. -- Stephan A. Rickauer ----------------------------------------------------------- Institut für Neuroinformatik Tel: +41 44 635 30 50 Universität / ETH Zürich Sek: +41 44 635 30 52 Winterthurerstrasse 190 Fax: +41 44 635 30 53 CH-8057 Zürich Web: www.ini.ethz.ch RSA public key: https://www.ini.ethz.ch/~stephan/pubkey.asc ----------------------------------------------------------- _______________________________________________ Ocfs2-users mailing list [email protected] http://oss.oracle.com/mailman/listinfo/ocfs2-users
