[Linux-HA] Re: Proper timing for 'cibadmin -R -x cib.xml'

Takenaka Kazuhiro Mon, 20 Aug 2007 00:03:45 -0700

Hi Andrew.

On 8/14/07, Takenaka Kazuhiro <[EMAIL PROTECTED]> wrote:

> Hi Andrew.
>
>  > On 8/9/07, Takenaka Kazuhiro <[EMAIL PROTECTED]> wrote:
>  >> > Hi Andrew, Thank you for your reply.
>  >> >
>  >> >  > On 8/8/07, Takenaka Kazuhiro <[EMAIL PROTECTED]> wrote:
>  >> >  >> > Hi All.
>  >> >  >> >
>  >> >  >> > I installed Heartbeat 2.1.2 into my cluster and tried
>  >> >  >> > the new way to invoke a cluster recommended in the following URL.
>  >> >  >> >
>  >> >  >> > 
http://www.linux-ha.org/v2/faq/cib_changes_detected?highlight=%28v2/faq/%2
>  >> >  >> >
>  >> >  >> > It works sanely, so I think I'd better to take it the
>  >> >  >> > formal procedure of invoking my cluster that I am planning
>  >> >  >> > to test for.
>  >> >  >> >
>  >> >  >> > On the adoption of the new way, I want to know a proper
>  >> >  >> > timing to execute 'cibadmin -R -x cib.xml'.  In other words,
>  >> >  >> > I want to know how to detect a cluster ready to respond
>  >> >  >> > client command's requests.
>  >> >  >> >
>  >> >  >> > If there is some command which enbales to detect the timing,
>  >> >  >> > it must be best.
>  >> >  >> >
>  >> >  >> > I think 'crm_mon -s' might be what I want.
>  >> >  >> >
>  >> >  >> > If 'crm_mon -s' shows 'Ok' at 1st field of it's report,
>  >> >  >> > I suppose that is a ready sign of a cluster for operators
>  >> >  >> > requests.
>  >> >  >> >
>  >> >  >> > Am I right?
>  >> >  >
>  >> >  > the best way, is to run:
>  >> >  >    crmadmin -D   # find out which node is the DC
>  >> >  >    crmadmin -S {uname_of_dc} # find out what status it's in
>  >> >  >
>  >> >  > if it says S_IDLE, then now is a good time to make changes
>  >> >
>  >> > I tried your method on my 2 nodes cluster
>  >> > but found a unfavorable behavior for me.
>  >> >
>  >> > Firstly, I performed 'crmadmin -D' before the start of
>  >> > my cluster and the command got over immediatly with an
>  >> > exit code 254.
>  >> >
>  >> > # crmadmin -D
>  >> > # echo $?
>  >> > 254
>  >> >
>  >> > It just went along the way I expected.
>  >> >
>  >> > In the next place, I invoked Heartbeats on both nodes of
>  >> > my cluster and performed the command before the DC node
>  >> > was elected.
>  >> >
>  >> > I expected the command would show some messages
>  >> > which ment no DC node was elected and would got
>  >> > over immediatly.
>  >> >
>  >> > But 'crmadmin -D' actually paused for tens of second,
>  >> > then the command showed a message and got over with
>  >> > an exit code 0.
>  >> >
>  >> > # crmadmin -D
>  >> > No messages received in 30 seconds.. aborting
>  >> > # echo $?
>  >> > 0
>  >
>  > I'll commit this patch shortly that should resolve this:
>  >
>  > diff -r 9355bd3d9af3 crm/admin/crmadmin.c
>  > --- a/crm/admin/crmadmin.c      Thu Aug 09 15:24:21 2007 +0200
>  > +++ b/crm/admin/crmadmin.c      Fri Aug 10 10:06:48 2007 +0200
>  > @@ -632,6 +632,7 @@ admin_message_timeout(gpointer data)
>  >                 (int)message_timeout_ms/1000);
>  >         crm_err("No messages received in %d seconds",
>  >                 (int)message_timeout_ms/1000);
>  > +       operation_status = -3;
>  >         g_main_quit(mainloop);
>  >         return FALSE;
>  >  }
>  >
>
> I read your patch and the source of crmadmin.
>
> I understood your patch and the undocumented crmadmin's
> option '-t' was useful.
>
> If I perform 'crmadmin -D -t TIMEOUT-msec', it is certain to
> run out within TIMEOUT-msec, so I can wait a end of a DC election
> at my favorable precision. If 'crmadmin' failed to run out
> with an exit code 253, I have only to retry until the command
> execution succeed.
>
> But I found another problem.
>
> 'crmadmin -D' runs out with an exit code 1 even if it can
> get and show the node name of DC.
>
> I found the following message in /var/log/messages after
> the command execution run out.
>
> Aug 14 14:51:43 it-gx2 crmadmin: [23056]: info: crmd_ipc_connection_destroy: 
Connection to CRMd was terminated
>
> I think this message should be concerned with the problem.
>
> How do you think?


You're right.
Fixed in http://hg.beekhof.net/lha/crm-dev/rev/46f826ba9650


Thanks for your patches.

Now I can wait for the server to be ready
by the following Bsh function.

wait_cluster_ready()
{
    typeset dc

    while ! dc=`crmadmin -D -t 1000`; do
        echo "DC is not elected" 1>&2
        sleep 1
    done

    dc=${dc#*: }
    echo "DC is $dc" 1>&2

    typeset dc_status cmd_output cmd_status errcnt=0
    while true; do
        cmd_output=`crmadmin -S $dc -t 1000`
        cmd_status=$?
        case $cmd_status in
            0)  # succeed to get dc_status
                dc_status=${cmd_output#*: }
                dc_status=${dc_status% *}
                if [[ "$dc_status" = "S_IDLE" ]]; then
                    echo "Now cluster got up" 1>&2
                    return 0
                else
                    echo "$cmd_output" 1>&2
                    (( errcnt = errcnt + 1 ))
                fi
                ;;
            253) # Connection timeout
                (( errcnt = errcnt + 1 ))
                ;;
            254) # Unable to connect with the DC
                (( errcnt = errcnt + 1 ))
                ;;
            *)   # Unexpected error
                echo "Unexpected error : $cmd_status" 1>&2
                return 1
                ;;
        esac

        if (( errcnt > 10 )); then
            echo "Too many errors occured" 1>&2
            return 1
        else
            sleep 1
        fi
    done
}

Sincerely.
--
Takenaka Kazuhiro <[EMAIL PROTECTED]>
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] Re: Proper timing for 'cibadmin -R -x cib.xml'

Reply via email to