Re: [Linux-HA] Help understand an incident

Andrew Beekhof Tue, 03 Jul 2007 08:15:28 -0700

On 7/3/07, Peter Kruse <[EMAIL PROTECTED]> wrote:

Hello list!


today in one of our clusters a failover occured.  Good news: it
succeeded.  But...  while looking through the logs we found
that messages are missing on one node so we can not say exactly
what happened.  Attached is the syslog from node-2 from the
time where there are no messages on node-1.  Is it possible
to say from that log what happened on node-1?


if it was just resource actions - then yes.  they'll all be recorded
in the CIB and produce updates like the one below.  look out for
failing monitors which probably triggered everything.

Especially there are messages like this:

Jul  3 11:22:59 beosrv-c-2 cibmon: [16501]: info: mask(cib_apply_diff):
+             <lrm_rsc_op id="nfs:maillastnfs_stop_0" operation="stop"
crm-debug-origin="do_update_resource"
transition_key="6:ad6f57b8-295b-4c20-8e0f-e01494577dfb"
transition_magic="2:152;6:ad6f57b8-295b-4c20-8e0f-e01494577dfb"
call_id="45" rc_code="152" op_status="2" interval="0"
__crm_diff_marker__="added:top"/>

Does that mean the action maillastnfs_stop_0 was run but returned
the status 2?


correct

Or is it possible that the action never was run
on node 1?


you'd have to match the uuid from the enclosing <node_state> object
but yes, it did actually get run - and according to the enum 2 :=
LRM_OP_TIMEOUT

typedef enum {
        LRM_OP_PENDING = -1,
        LRM_OP_DONE,
        LRM_OP_CANCELLED,
        LRM_OP_TIMEOUT,
        LRM_OP_NOTSUPPORTED,
        LRM_OP_ERROR
}op_status_t;

for rc values, refer to ocf-returncodes somewhere under /usr/lib/ocf
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Help understand an incident

Reply via email to