[Linux-HA] Antw: Re: Q: "lost vote" while network seems up

Ulrich Windl Wed, 12 Oct 2011 01:10:57 -0700

>>> Andrew Beekhof <[email protected]> schrieb am 12.10.2011 um 04:42 in 
>>> Nachricht
<CAEDLWG3p+=myur8a45cm4hfprmclvbekxbheiqkovhy++dk...@mail.gmail.com>:
> On Thu, Sep 29, 2011 at 6:09 PM, Ulrich Windl
> <[email protected]> wrote:
> > Hello!
> >
> > I'm examining a case where both nodes of a two node cluster were fenced at 
> the same time. The cluster is running SLES11 SP1 with a corosync 1.4.1 Update 
> to make the rrp stable. I found strange messages:
> >
> > 08:15:25 h02 cib: [10993]: WARN: cib_process_replace: Replacement 0.952.21 
> not applied to 0.952.23: current num_updates is greater than the replacement
> > 08:15:25 h02 cib: [10993]: WARN: cib_diff_notify: Update (client: crmd, 
> call:13834): -1.-1.-1 -> 0.952.21 (Update was older than existing 
> configuration)
> > 08:15:25 h02 crmd: [10997]: WARN: finalize_sync_callback: Sync from h06 
> resulted in an error: Update was older than existing configuration
> > 08:15:25 h02 crmd: [10997]: WARN: do_log: FSA: Input I_ELECTION_DC from 
> finalize_sync_callback() received in state S_FINALIZE_JOIN
> 
> Was there a cluster partition at this time?


Hi!

Yes, I had shut down corosync on both nodes for a corosync update. Naturally 
the node that terminates last has the latest CIB I guess. Unfortunately you 
cannot always start up that node first, and even if, the second node will have 
an obsolete CIB. If you start the wrong node first, that node's CIB may be 
later (by version number) than the one that was more current (by content). How 
does pacemaker handle these situations?

Most cluster software has to handle these problems, but most do with less 
confusing noise in the logs.

Specifically, when is a version considered to be "-1"?

> Looks like one got further ahead than the other, but since we
> regenerate the resource state after an election there is no harm here.

I hoped so ;-)

[...]
> > 08:23:02 h06 crmd: [10847]: debug: crm_compare_age: Loose: 18 vs 268 
> (seconds)
> > 08:23:02 h06 crmd: [10847]: debug: do_election_count_vote: Election 5 
> (owner: h02) lost: vote from h02 (Uptime)
> 
> The colon is important.  h06 lost the election because of the vote.
> There was no "lost vote".
> 
> > 08:23:02 h06 crmd: [10847]: info: update_dc: Unset DC h02
> > 08:23:03 h06 crmd: [10847]: debug: do_cl_join_finalize_respond: join-6: 
> > Join 
> complete. Sending local LRM status to h02
> > 08:23:04 h06 crmd: [10847]: debug: get_xpath_object: No match for 
> //cib_update_result//diff-added//crm_config in /notify/cib_update_result/diff
> > 08:24:01 h06 crmd: [10847]: debug: get_xpath_object: No match for 
> //cib_update_result//diff-added//crm_config in /notify/cib_update_result/diff
> >
> > Around at that time I also had this strange message:
> > h02:~ # crm_resource -C -r prm_ocfs_fs_samba:0 -N h06
> > Cleaning up prm_ocfs_fs_samba:0 on h06
> > Waiting for 2 replies from the CRMd.
> >
> > No messages received in 60 seconds.. aborting
> >
> > Does anybody have an idea what could be wrong? I think the network was ok.

I'd like to have an explanation for this as well.

Thanks for explaining, anyway.

Regards,
Ulrich



_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] Antw: Re: Q: "lost vote" while network seems up

Reply via email to