> On Dec 1, 2014, at 10:57, Digimer <li...@alteeve.ca> wrote:
> 
> On 01/12/14 09:16 AM, Megan . wrote:
>> We decided to use this cluster solution in order to share GFS2 mounts
>> across servers.  We have a 7 node cluster that is newly setup, but
>> acting oddly.  It has 3 vmware guest hosts and 4 physical hosts (dells
>> with Idracs).  They are all running Centos 6.6.  I have fencing
>> working (I'm able to do fence_node node and it will fence with
>> success).  I do not have the gfs2 mounts in the cluster yet.
> 
> Very glad you have fencing, that's a common early mistake.
> 
> 7-node cluster is actually pretty large and is around the upper-end before 
> tuning starts to become fairly important.

Ha, I was unaware this was part of the folklore.  We have a couple of 9-node 
clusters, it did take some tuning to get them stable, and we are thinking about 
splitting one of them.  For our clusters, we found uniform configuration helped 
a lot, so a mix of physical hosts and VMs in the same (largish) cluster would 
make me a little nervous, don't know about anyone else's feelings.

>> I can manually fence it, and it still comes online with the same
>> issue.  I end up having to take the whole cluster down, sometimes
>> forcing reboot on some nodes, then brining it back up.  Its takes a
>> good part of the day just to bring the whole cluster online again.
> 
> Something fence related is not working.

We used to see something like this, and it usually traced back to "shouldn't be 
possible" inconsistencies in the fence group membership.  Once the fence group 
gets blocked by a mismatched membership list, everything above it breaks.  
Sometimes a "fence_tool ls -n" on all the cluster members will reveal an 
inconsistency ("fence_tool dump" also, but that's hard to interpret without 
digging into the group membership protocols).  If you can find an 
inconsistency, manually fencing the nodes in the minority might repair it.

At the time, I did quite a lot of staring at fence_tool dumps, but never 
figured out how the fence group was getting into "shouldn't be possible" 
inconsistencies.  This was also all late RHEL5 and early RHEL6, so may not be 
applicable anymore.

> My recommendation would be to schedule a maintenance window and then stop 
> everything except cman (no rgmanager, no gfs2, etc). Then methodically test 
> crashing all nodes (I like 'echo c > /proc/sysrq-trigger) and verify they are 
> fenced and then recover properly. It's worth disabling cman and rgmanager 
> from starting at boot (period, but particularly for this test).
> 
> If you can reliably (and repeatedly) crash -> fence -> rejoin, then I'd start 
> loading back services and re-trying. If the problem reappears only under 
> load, then that's an indication of the problem, too.

I'd agree--start at the bottom of the stack and work your way up.

-dan


-- 
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Reply via email to