Hello,
I'm investigating the use of corosync and pacemaker to manage our file
system cluster, and I'm running into some not unexpected issues. For
many reasons, it makes sense to manage all of the nodes as a single
cluster, but it would appear that pacemaker is not currently suitable
for a ~200 node cluster, and that corosync will require some tuning to
get there. As I said, not unexpected.
To separate concerns, I've focusing on getting corosync up and stable at
smaller scales first, and then plan to get pacemaker happy once there is
a solid foundation. To that end, I've started with smaller clusters, 12
to 48 nodes or so -- using GigE currently, though I would prefer to
eventually use a redundant ring over Infiniband.
I've been using the following in the configuration file at the moment:
join: 50
token: 2000
consensus: 5000
I've tried a few other settings as well, but the ring seems to become
unstable after 70 or so nodes, and it may also have some stability
issues at lower scales, especially around configuration changes, where
multiple rings will be formed and dissolved in rapid succession. It will
often settle down at smaller scales and include all active nodes, but at
the larger scales it will continue this instability indefinitely, as
well as cause some nodes to segfault or get confused about the current
sequence number expected. In the case where the configuration does
stabilize, I have seen it get to a state where it seems to be passing
only 4 to 8 messages per second as measured by the log output from the
SYNC service. Pacemaker has been disabled for this work.
Does anyone have some suggestions on good timing parameters to use for
rings of this size? I can probably work my way through the papers on
Totem to deduce some numbers, but perhaps the experienced hands here
have some idea of the ballpark I'm looking for.
As for the segfault, it is the result of totempg_deliver_fn() being
handed an encapsulated packet and then misinterpreting it. This was
handed down from messages_deliver_to_app(), and based on the flow around
deliver_messages_from_recovery_to_regular() I expect that it should not
see encapsulated messages. Looking through the core dump, the
multi-encapsulated message is from somewhat ancient ring instances: the
current ringid seq is 38260, and the outer encapsulation is for seq
38204 with an inner encapsulation of seq 38124. It seems this node was
last operation in ring 38204, and had entered recovery state a number of
times without landing in operational again prior to the crash.
I have a core dump of this occurring in corosync 1.2.1, as well as the
logs from the node that crashed and one or two others in the cluster.
I've looked through the changes to 1.2.2 and 1.2.3, but nothing stands
out as likely to solve this. Building new versions is somewhat painful
on this diskless cluster, so I'll try to reproduce with 1.2.3 before
building custom versions. I can probably make the logs available to
interested parties as well.
While working with pacemaker prior to focusing on corosync, I noticed on
several occasions where corosync would get into a situation were all
nodes of the cluster were considered members of the ring, but some nodes
were working with sequence numbers that were several hundred behind
everyone else, and did not catch up. I have not seen this in a
corosync-only test, but I suspect it may be related to the segfault
above -- it only seemed to occur after a pass through the recovery state.
Any suggestions on how to proceed to put this bug to bed?
Thanks,
Dave
_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais