David Teigland wrote: > On Thu, Nov 19, 2009 at 07:10:54PM +0100, Fabio M. Di Nitto wrote: > > The error is detected in gfs. For every error in every bit of code, the > developer needs to consider what the appropriate error handling should be: > What are the consequences (with respect to availability and data > integrity), both locally and remotely, of the error handling they choose? > It's case by case. > > If the error could lead to data corruption, then the proper error handling > is usually to fail fast and hard.
of course, agreed. > > If the error can result in remote nodes being blocked, then the proper > error handling is usually self-sacrifice to avoid blocking other nodes. ok, so this is the case we are seeing here. the cluster is half blocked but there is no self-sacrifice action happening. > > Self-sacrifice means forcibly removing the local node from the cluster so > that others can recover for it and move on. There are different ways of > doing self-sacrifice: > > - panic the local machine (kernel code usually uses this method) > - killing corosync on the local machine (daemons usually do this) > - calling reboot (I think rgmanager has used this method) I don´t have an opinion on how it happens really, as long as it works. > >> panic_on_oops is not cluster specific and not all OOPS are panic == not >> a clean solution. > > So you want gfs oopses to result in a panic, and non-gfs oopses to *not* > result in a panic? Well partially yes. We can´t take decision for OOPSes that are not generated within our code. The user will have to configure that via panic_on_oops or other means. Maybe our task is to make sure users are aware of this situation/option (i didn´t check if it is documented). You have a point by saying that it depends from error to error and this is exactly where I´d like to head. Maybe it´s time to review our error paths and make better decisions on what to do. At least within our code. There's probably a combination of options that would > produce this effect. Most people interested in HA will want all oopses to > result in a panic and recovery since an oops puts a node in a precarious > position regardless of where it came from. I agree, but I don´t think we can kill the node on every OOPS by default. We can agree that has to be a user configurable choice but we can improve our stuff to do the right thing (or do better what it does now). Fabio
