After confirming with Stephan, this problem appears to relate to the 
HEARTBEAT_THRESHOLD parameter as set in /etc/sysconfig/o2cb. After encountering 
this myself and having confirmed with a couple of other people in the list that 
it has caused problems, it seems that the default threshold of 7 is possibly 
too short, even in reasonably fast server-storage solutions such as an HP DL380 
Packaged Cluster.

Does the OCFS2 development team also consider this to be too short, or is 
altering the paramater just a workaround that shouldn't be used? If this is the 
case then how should we approach the problem of self-fencing nodes? 

Also, can we expect this behaviour with some platforms but not others, or is it 
too short for all platforms? If it is a blanket problem, then should the 
default threshold be raised?

Finally, if the altering the threshold is a valid solution, could it please be 
added to the FAQs and the user guide so that people know to adjust it as a 
first step on encountering the problem, rather than having to post to the list 
and wait for replies. 

Regards,
Gavin
 

-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Stephan A. 
Rickauer
Sent: Thursday, 30 March 2006 00:47
To: [email protected]
Subject: [Ocfs2-users] heartbeat write timeout

Dear list,

I am evaluating ocfs2 in a test environment, that currently runs a "cluster" in 
a one node mode (AMD Opteron, 2GB RAM, RH AS4 (CentOS 4.3),
2.6.9-34.EL) connected to an iSCSI storage device. While doing load tests with 
'bonnie++' to test the performance of the storage device together with the file 
system I experience regular kernel panics related to ocfs2 (1.2.0 RPMs).

Here is the message I get (I did not want to file a bug yet, maybe it's just me 
missing something). sdb1 is the iscsi device:

---snip---
(3,0):o2hb_write_timeout: 164 ERROR: Heartbeat write timeout to device
sdb1 after 12000 milliseconds
(3,0):02hb_stop_all_regions: 1727 ERROR: stopping heartbeat on all active 
regions Kernel panic - not syncing: ocfs2 is very sorry to be fencing this 
system by panicing
---snip---

I am tempted to rule out iscsi storage device related problems, but this is not 
100% sure, though tests with GFS and ext3 did not reveal comparable problems.

On the bug page I spotted ID565 which seems to fit my szenario, but the status 
of the bug is unclear to me (references to version 0.99 are
given): http://oss.oracle.com/bugzilla/show_bug.cgi?id=565

Any help / comments etc. are appreciated.
Thanks.

-- 

 Stephan A. Rickauer

 -----------------------------------------------------------
 Institut für Neuroinformatik          Tel: +41 44 635 30 50
 Universität / ETH Zürich              Sek: +41 44 635 30 52
 Winterthurerstrasse 190               Fax: +41 44 635 30 53
 CH-8057 Zürich                        Web:  www.ini.ethz.ch

 RSA public key: https://www.ini.ethz.ch/~stephan/pubkey.asc
 -----------------------------------------------------------

_______________________________________________
Ocfs2-users mailing list
[email protected]
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Reply via email to