On 07/02/2013 04:02 AM, Dejan Muhamedagic wrote: > On Mon, Jul 01, 2013 at 11:53:29AM -0400, Digimer wrote: >> On 07/01/2013 04:52 AM, Dejan Muhamedagic wrote: >>> Right. It is often missed that actually more than one failure is >>> required for that setup to fail. In case of dual PDU/PSU/UPS an >>> IPMI based fencing is sufficient. >> >> You are right, of course. Imagine though that the IPMI BMC's network >> port or cable could have silently failed some time before the node >> failed. Yes, this is two simultaneous failues so not an overall SPoF, >> but likely enough that it should be addressed. >> >> If you've already setup redundant power, then it strikes me as fairly >> easy to use your PDUs as a backup fence method. >> >> Now all this said, you'll note in the mailing lists and IRC that I don't >> tell people they should have two methods. If people setup just IPMI >> fencing, I am happy. It's a question of how careful do you want/need to >> be, after that. For me, one fence method is not enough. > > I suppose that you're supporting a few clusters. How often does > it happen that nodes get fenced? And why? And did you in those > cases needed to use the backup fence device? > > Thanks, > > Dejan
They occasionally get fenced, but it's very rare. Most were from an earlier configuration I no longer offer that were based on one switch (with redundant NICs in bond mode=1). The switch would hiccup and that would trigger fencing. Since I switched to dual switches, I've not had a network-triggered failure. The most common problem I see, that my cluster saved people from, is power problems. These have never required fencing, but rather simply having two monitored UPSes has allowed us to detecting pending catastrophic power failures (a transformer blew up three days after we started seeing alerts, a faulty regulator in a customer's neighborhood, etc). We've also saved a customer's entire (small) DC when they lost AC and their own alerts failed (we saw a sudden rise in inlet temp and alerted the client.). One node at the top of the rack (out of four dual-node clusters) went into thermal shutdown and got fenced before we could shed enough load. They didn't lose any of their non-clustered servers though. So to your question; have we ever needed the backup fencing in production? Nope, but I see it as just a matter of time. One user error, one bad UPS/battery pack, one tripped breaker and it will save us. When we demo our clusters to perspective customers, the most dramatic test we do is shut down the primary UPS. This takes out one of the switches, one of the dashboard appliances and forces the nodes to run on half their power. If this happened in production, then dual-PDUs would certainly save us. Not my personal experience, but a sysadmin friend of mine had a case where a server's 12vDC wire was rubbing against a sharp piece of the chassis. Eventually it cut through the insulation and shorted out, taking the node's power off despite having redundant PSUs. Had this happened to our cluster, we'd have been saved by the backup fence device because the IPMI would have been lost. I've got ten or so customers around north america and I've only been doing this for four years or so. That I have not *yet* been saved by backup fencing in no way means it is not needed. :) -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org