On 19 Mar 2014, at 10:15 am, Andrew Beekhof <and...@beekhof.net> wrote:

> 
> On 18 Mar 2014, at 10:04 pm, Gabriel Gomiz <ggo...@cooperativaobrera.coop> 
> wrote:
> 
>> Maybe, this is significant : 'Our DC node 
>> (gandalf.san01.cooperativaobrera.coop) left the cluster' ... ?
> 
> Very. I hadn't noticed it was the DC at the time it died.
> 
>> 
>> Please tell me if you need more details:
> 
> Can I get the file logs from lorien from Mar 08 08:43:00 to 09:14:00 please?
> 

Riiiight, so this is the story:

Mar 08 08:43:22 [9934] lorien       crmd:     info: do_dc_takeover:     Taking 
over DC status for this partition
Mar 08 08:43:22 [9934] lorien       crmd:   notice: tengine_stonith_notify:     
Peer gandalf was terminated (st_notify_fence) by mordor for gandalf: OK 
(ref=10d27664-33ed-43e0-a5bd-7d0ef850eb05) by client crmd.31561
Mar 08 08:43:22 [9934] lorien       crmd:   notice: tengine_stonith_notify:     
Notified CMAN that 'gandalf' is now fenced
Mar 08 08:43:22 [9934] lorien       crmd:   notice: tengine_stonith_notify:     
Target may have been our leader gandalf (recorded: <unset>)
Mar 08 09:13:52 [9934] lorien       crmd:     info: do_dc_takeover:     Taking 
over DC status for this partition
Mar 08 09:13:52 [9934] lorien       crmd:   notice: do_dc_takeover:     Marking 
gandalf, target of a previous stonith action, as clean

In tengine_stonith_notify() we potentially add things to stonith_cleanup_list 
and then in do_dc_takeover() we check the stonith_cleanup_list and mark any 
nodes in it as clean.

As you can see above, the stonith notification comes just after the call to 
do_dc_takeover().
In the version you have there is some dodgy code in tengine_stonith_notify() 
which incorrectly adds gandalf to stonith_cleanup_list, causing Pacemaker to 
(incorrectly) erase its status section at 9:13:52 when another election occurs.

This was fixed during the RC-phase of Pacemaker-1.1.10:

  https://github.com/beekhof/pacemaker/commit/f30e1e43

I don't believe I quite understood the severity of that fix at the time 
(otherwise I'd have made more noise about it).

Since you're on CentOS 6.4, there should already be updated packages that 
include this fix.

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to