Re: [Linux-ha-dev] Re: confused about the bully election algorithm implementation in crmd

Andrew Beekhof Fri, 05 Jan 2007 04:31:27 -0800

thankyou for your knowledgeable input!

On 1/5/07, home_king <[EMAIL PROTECTED]> wrote:


Hi, Andrew.

Nice patch. :)

However, there exits other problems.

1. do_election_check() will compare the size of 'voted' hashtbl with the
size
of fsa_membership_copy->members hashtbl. However, if 'voted' contains some
obsolete items (NOVOTEs with old election_id), they will be taken into
account in error.


They wont ever have the old election_id since the hash is destroyed
when the election is complete.

However, if we marked it complete based on old NOVOTEs, then it could
potentially contain the newer ones which would be a problem.

I believe the correct solution is to destroy "voted" after every vote
has been sent and I will be testing that later today.


That is, once launching a new election, your patch can refuse old NOVOTE but
have no chance to purge the old NOVOTE recorded in 'voted' hashtbl before.

Given there is 3 nodes: A, B, C.
A_bornon > B_bornon > C_bornon
B launchs the first election, A sends NOVOTE to B.
C launchs the second election, both A & B make C master.
Then, C is down, B launchs the third election, it immediately becomes
master,
no matter whether A votes for it or not! Because the 'voted' hashtbl of B
contains the old NOVOTE which comes from A!

2. CIB will not work under some special scenario.
What happens when the election is processing, the admin launchs some CIB
operation, for example, add_src? At this time, no CIB master is non-exist,
this operation will never be handled & replied! The admin such as mgmtd,
will be block forever.
This problem is the same with that when DC exits, or when the JOIN protocol
processing (the full-sync not be launched yet).
CIB may provide a "lock" mechanism to its users, with which crmd can freeze
the CIB & keep data safe when crmd is in unstable states.


Basically thats what you're already seeing.
Granted clients shouldn't be kept waiting forever (there was some
timeout code but I wasn't happy with it and have not had time to
return and write it properly).

The -l switch for cibadmin was created for this exact reason, so that
an admin that was really sure what they wanted could continue to
change the CIB during these times, and everyone else would wait for
the cluster to stabilize and try again.

Though, as above, this falls short when the tools dont return.

3. The wrong use of "stall" of fsa will cause deadlock
R_CCM_DATA must be set before the fsa come into PENDING state, or
do_started()
will stall the fsa, which prepends itself in the fsa queue. Once the fsa is
stalled, other fsa_jobs cannot be processed. However, setting the flag of
R_CCM_DATA just happens in a fsa_job -- do_ccm_update_cache(), which is
registered by the I_CCM_EVENT handler -- crmd_ccm_msg_callback():
        register_fsa_input_adv(
            C_CCM_CALLBACK, I_CCM_EVENT, event_data,
            trigger_transition?A_TE_CANCEL:A_NOTHING,
            FALSE, __FUNCTION__);

You see, do_ccm_update_cache() has no chance to run because fsa is stalled,
meanwhile, the stall fn do_started() waits the result of
do_ccm_update_cache().
Here is the deadlock.


Have you ever experienced this?  (Just curious)
It should be unlikely since that A_STARTED can never be called until
A_CCM_CONNECT completes successfully (usually implying that R_CCM_DATA
will be set)

I think the answer is to change:
                register_fsa_input_adv(
                        C_CCM_CALLBACK, I_CCM_EVENT, event_data,
                        trigger_transition?A_TE_CANCEL:A_NOTHING,
                        FALSE, __FUNCTION__);
to:
                register_fsa_input_adv(
                        C_CCM_CALLBACK, I_CCM_EVENT, event_data,
                        trigger_transition?A_TE_CANCEL:A_NOTHING,
                        TRUE, __FUNCTION__);

I will also test that today to see if there are any negative side-effects

BTW, I have some questions about some code implementation:

1. Why not use libxml2 & glib n-ray tree to construct the internal XML
representation?
libxml2 can use to retrieve the skeleton from xml file, and then we can
convert this base into an glib n-ray tree, whose nodes are our internal
structure. When sending the xml data, we just traverse this tree into a
ha_msg structure; When writing the xml file, we can just use libxml2
directly.


originally the crm was written based on libxml2 until I got so
frustrated with it i ripped it out.

2. Can we slim the election & join protocol;


this may be possible, especially once running on openAIS is possible

Can we slim the state machine?


maybe

if election, integration and finalize were combined (which might be
possible given the above) that might help.

i've also toyed with the idea of using more register values (R_*) and
less inputs (but i've not had the time to investigate the feasibility
of this), that too might reduce the state machine's size

I always found they are complex & hard to understand and the code is huge.


I'd be lying if I said I didnt get lost in there occasionally too.
However I don't believe there is much "excess fat" available to be
trimmed (only rearranged).

But if you have some ideas for improvements (or even better patches
:-) they would be most welcome.

Maybe the full design & implentation of crmd family is meaningful to the
linux-ha fans or even the developers. Thanks. :)

_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] Re: confused about the bully election algorithm implementation in crmd

Reply via email to