I have a suggestion about the heartbeat and the way that "downed node" 
detection works.

There are occasions where a node is up, and for whatever reason, I need to 
power cycle it (for instance, a frozen process, etc). In these instances, my 
other nodes are unable to perform file system operations until the heartbeat 
period expires. This ends up being somewhere around 30-60 seconds (this is the 
value which works best for me, and does not cause self fencing). It would be 
useful to allow me to force the remaining nodes to just understand the node was 
taken down purposefully, and move on with their lives.

A real world example:

OCFS2 hosting files used with a website, driven by Apache. If a node goes down, 
the load average on all remaining nodes skyrockets to 500 or more, as the 
Apache processes all enter a state of uninterruptible sleep. This triggers 
alerts, pages, and on occasion, application specific triggers (web app) that 
show a "Too Busy" page when the load average is too high (for instance, that 
which vBulletin does).

It would be magnificent to be able to instruct the remaining nodes that the 
node in question was taken down purposefully, and to go on about their lives 
immediately (beginning of course with the journal replay, etc).

It's very simple in concept, and probably also execution. Could something like 
this be added? It would allow me to really do wonderful things from a STONITH 
perspective.

Thanks,
Michael
_______________________________________________
Ocfs2-users mailing list
[email protected]
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Reply via email to