On 01/12/14 11:34 AM, Dan Riley wrote:
Ha, I was unaware this was part of the folklore.  We have a couple of 9-node 
clusters, it did take some tuning to get them stable, and we are thinking about 
splitting one of them.  For our clusters, we found uniform configuration helped 
a lot, so a mix of physical hosts and VMs in the same (largish) cluster would 
make me a little nervous, don't know about anyone else's feelings.

Personally, I only build 2-node clusters. When I need more resource, I drop-in another pair. This allows all my clusters, going back to 2008/9, to have nearly identical configurations. In HA, I would argue, nothing is more useful than consistency and simplicity.

That said, I'd not fault anyone for going to 4 or 5 node. Beyond that though, I would argue that the cluster should be broken up. In HA, uptime should always trump resource utilization efficiency.

Mixing real and virtual machines strikes me as an avoidable complexity, too.

Something fence related is not working.

We used to see something like this, and it usually traced back to "shouldn't be possible" 
inconsistencies in the fence group membership.  Once the fence group gets blocked by a mismatched membership 
list, everything above it breaks.  Sometimes a "fence_tool ls -n" on all the cluster members will 
reveal an inconsistency ("fence_tool dump" also, but that's hard to interpret without digging into 
the group membership protocols).  If you can find an inconsistency, manually fencing the nodes in the 
minority might repair it.

In all my years, I've never seen this happen in production. If you can create a reproducer, I would *love* to see/examine it!

At the time, I did quite a lot of staring at fence_tool dumps, but never figured out how 
the fence group was getting into "shouldn't be possible" inconsistencies.  This 
was also all late RHEL5 and early RHEL6, so may not be applicable anymore.

HA in 6.2+ seems to be quite a bit more stable (I think for more reasons than just the HA stuff). For this reason, I am staying on RHEL 6 until at least 7.2+ is out. :)

My recommendation would be to schedule a maintenance window and then stop 
everything except cman (no rgmanager, no gfs2, etc). Then methodically test 
crashing all nodes (I like 'echo c > /proc/sysrq-trigger) and verify they are 
fenced and then recover properly. It's worth disabling cman and rgmanager from 
starting at boot (period, but particularly for this test).

If you can reliably (and repeatedly) crash -> fence -> rejoin, then I'd start 
loading back services and re-trying. If the problem reappears only under load, then 
that's an indication of the problem, too.

I'd agree--start at the bottom of the stack and work your way up.

-dan

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without access to education?

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Reply via email to