Re: [Linux-HA] help to debug

Dejan Muhamedagic Mon, 17 Aug 2009 14:47:23 -0700

Hi,

On Mon, Aug 17, 2009 at 03:16:16PM +0200, Cristina Bulfon wrote:
> Ciao,
>
> during my holiday I had problem with heartbeat. Checking the log file I 
> find the following line
>
> - On ACTIVE machine
>
> heartbeat[27868]: 2009/07/19_08:33:43 info: Daily informational memory 
> statistics
> heartbeat[27868]: 2009/07/19_08:33:43 info: MSG stats: 7/663233 ms age 0 
> [pid27868/MST_CONTROL]
> heartbeat[27868]: 2009/07/19_08:33:43 info: cl_malloc stats: 0/0  0/0 
> [pid27868/MST_CONTROL]
> heartbeat[27868]: 2009/07/19_08:33:43 info: RealMalloc stats: 0 total 
> malloc bytes. pid [27868/MST_CONTROL]
> heartbeat[27868]: 2009/07/19_08:33:43 info: Current arena value: 0
> ......................
> heartbeat[27868]: 2009/07/25_08:34:02 info: MSG stats: 0/0 ms age 
> 5158152440 [pid27875/HBREAD]
> heartbeat[27868]: 2009/07/25_08:34:02 info: cl_malloc stats: 0/0  0/0 
> [pid27875/HBREAD]
> heartbeat[27868]: 2009/07/25_08:34:02 info: RealMalloc stats: 0 total 
> malloc bytes. pid [27875/HBREAD]
> heartbeat[27868]: 2009/07/25_08:34:02 info: Current arena value: 0
> heartbeat[27868]: 2009/07/25_08:34:02 info: These are nothing to worry 
> about.
> lrmd[27881]: 2009/07/26_02:29:11 WARN: G_SIG_dispatch: Dispatch function 
> for SIGCHLD was delayed 1000 ms (> 100 ms) before being called (GSource: 0x
> 519bc0)
> lrmd[27881]: 2009/07/26_02:29:11 info: G_SIG_dispatch: started at 522265947 
> should have started at 522265847
> ........
> heartbeat[27868]: 2009/08/01_08:34:23 info: MSG stats: 0/0 ms age 
> 5762952450 [pid27875/HBREAD]
> heartbeat[27868]: 2009/08/01_08:34:23 info: cl_malloc stats: 0/0  0/0 
> [pid27875/HBREAD]
> heartbeat[27868]: 2009/08/01_08:34:23 info: RealMalloc stats: 0 total 
> malloc bytes. pid [27875/HBREAD]
> heartbeat[27868]: 2009/08/01_08:34:23 info: Current arena value: 0
> heartbeat[27868]: 2009/08/01_08:34:23 info: These are nothing to worry 
> about.
>
> - On STANDBY machine
>
> heartbeat[27879]: 2009/07/19_08:34:13 info: Daily informational memory 
> statistics
> heartbeat[27879]: 2009/07/19_08:34:13 info: MSG stats: 9/663189 ms age 0 
> [pid27879/MST_CONTROL]
> heartbeat[27879]: 2009/07/19_08:34:13 info: cl_malloc stats: 0/0  0/0 
> [pid27879/MST_CONTROL]
> heartbeat[27879]: 2009/07/19_08:34:13 info: RealMalloc stats: 0 total 
> malloc bytes. pid [27879/MST_CONTROL]
> heartbeat[27879]: 2009/07/19_08:34:13 info: Current arena value: 0
> ............................
> heartbeat[27879]: 2009/07/25_08:34:21 info: MSG stats: 0/0 ms age 
> 5158086760 [pid27886/HBREAD]
> heartbeat[27879]: 2009/07/25_08:34:21 info: cl_malloc stats: 0/0  0/0 
> [pid27886/HBREAD]
> heartbeat[27879]: 2009/07/25_08:34:21 info: RealMalloc stats: 0 total 
> malloc bytes. pid [27886/HBREAD]
> heartbeat[27879]: 2009/07/25_08:34:21 info: Current arena value: 0
> heartbeat[27879]: 2009/07/25_08:34:21 info: These are nothing to worry 
> about.
> ............................
> heartbeat[27879]: 2009/08/01_09:11:07 WARN: node afsitfs3.roma1.infn.it: is 
> dead
> heartbeat[27879]: 2009/08/01_09:11:07 info: Link 
> afsitfs3.roma1.infn.it:eth1 dead.
> pingd[27889]: 2009/08/01_09:11:07 notice: pingd_nstatus_callback: Status 
> update: Ping node afsitfs3.roma1.infn.it now has statu
> s [dead]
> heartbeat[27879]: 2009/08/01_09:11:07 info: Link 
> afsitfs3.roma1.infn.it:eth0 dead.
> pingd[27889]: 2009/08/01_09:11:07 notice: pingd_lstatus_callback: Status 
> update: Ping node afsitfs3.roma1.infn.it now has statu
> s [dead]
> pingd[27889]: 2009/08/01_09:11:07 notice: pingd_nstatus_callback: Status 
> update: Ping node afsitfs3.roma1.infn.it now has statu
> s [dead]
> crmd[27895]: 2009/08/01_09:11:07 notice: crmd_ha_status_callback: Status 
> update: Node afsitfs3.roma1.infn.it now has status [de
> ad]
> pingd[27889]: 2009/08/01_09:11:07 notice: pingd_lstatus_callback: Status 
> update: Ping node afsitfs3.roma1.infn.it now has statu
> s [dead]
> pingd[27889]: 2009/08/01_09:11:07 notice: pingd_nstatus_callback: Status 
> update: Ping node afsitfs3.roma1.infn.it now has statu
> s [dead]
> cib[27891]: 2009/08/01_09:11:07 info: mem_handle_event: Got an event 
> OC_EV_MS_INVALID from ccm
> cib[27891]: 2009/08/01_09:11:07 info: mem_handle_event: no mbr_track info
> cib[27891]: 2009/08/01_09:11:07 info: mem_handle_event: Got an event 
> OC_EV_MS_NEW_MEMBERSHIP from ccm
> cib[27891]: 2009/08/01_09:11:07 info: mem_handle_event: instance=3, 
> nodes=1, new=0, lost=1, n_idx=0, new_idx=1, old_idx=3
> ccm[27890]: 2009/08/01_09:11:07 info: Break tie for 2 nodes cluster
> crmd[27895]: 2009/08/01_09:11:07 info: mem_handle_event: Got an event 
> OC_EV_MS_INVALID from ccm
> cib[27891]: 2009/08/01_09:11:07 info: cib_ccm_msg_callback: LOST: 
> afsitfs3.roma1.infn.it
> crmd[27895]: 2009/08/01_09:11:07 info: mem_handle_event: no mbr_track info
> cib[27891]: 2009/08/01_09:11:07 info: cib_ccm_msg_callback: PEER: 
> afsitfs4.roma1.infn.it
> crmd[27895]: 2009/08/01_09:11:07 info: mem_handle_event: Got an event 
> OC_EV_MS_NEW_MEMBERSHIP from ccm
> crmd[27895]: 2009/08/01_09:11:07 info: mem_handle_event: instance=3, 
> nodes=1, new=0, lost=1, n_idx=0, new_idx=1, old_idx=3
> crmd[27895]: 2009/08/01_09:11:07 info: crmd_ccm_msg_callback: Quorum 
> (re)attained after event=NEW MEMBERSHIP (id=3)
> crmd[27895]: 2009/08/01_09:11:07 info: erase_node_from_join: Removed dead 
> node afsitfs3.roma1.infn.it from join calculations: w
> elcomed=0 itegrated=0 finalized=0 confirmed=1
> crmd[27895]: 2009/08/01_09:11:07 info: ccm_event_detail: NEW MEMBERSHIP: 
> trans=3, nodes=1, new=0, lost=1 n_idx=0, new_idx=1, ol
> d_idx=3
> crmd[27895]: 2009/08/01_09:11:07 info: ccm_event_detail:        CURRENT: 
> afsitfs4.roma1.infn.it [nodeid=1, born=3]
> crmd[27895]: 2009/08/01_09:11:07 info: ccm_event_detail:        LOST:    
> afsitfs3.roma1.infn.it [nodeid=0, born=2]
> tengine[27909]: 2009/08/01_09:11:07 WARN: match_down_event: No match for 
> shutdown action on 586817af-703a-4eff-ac9b-b96de063493
> a
> tengine[27909]: 2009/08/01_09:11:07 info: extract_event: Stonith/shutdown 
> of 586817af-703a-4eff-ac9b-b96de063493a not matched
> crmd[27895]: 2009/08/01_09:11:07 info: do_state_transition: State 
> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=
> C_IPC_MESSAGE origin=route_message ]
> tengine[27909]: 2009/08/01_09:11:07 info: update_abort_priority: Abort 
> priority upgraded to 1000000
> crmd[27895]: 2009/08/01_09:11:07 info: do_state_transition: All 1 cluster 
> nodes are eligible to run resources.
> pengine[27910]: 2009/08/01_09:11:07 info: determine_online_status: Node 
> afsitfs4.roma1.infn.it is online
> pengine[27910]: 2009/08/01_09:11:07 notice: group_print: Resource Group: 
> group_1
> pengine[27910]: 2009/08/01_09:11:07 notice: native_print:     
> IPaddr_141_108_26_27      (ocf::heartbeat:IPaddr):        Stopped
>
> pengine[27910]: 2009/08/01_09:11:07 notice: native_print:     Filesystem_2  
>     (ocf::heartbeat:Filesystem):    Stopped
> pengine[27910]: 2009/08/01_09:11:07 notice: native_print:     Filesystem_3  
>     (ocf::heartbeat:Filesystem):    Stopped
> pengine[27910]: 2009/08/01_09:11:07 notice: native_print:     Filesystem_4  
>     (ocf::heartbeat:Filesystem):    Stopped
> pengine[27910]: 2009/08/01_09:11:07 notice: native_print:     Filesystem_5  
>     (ocf::heartbeat:Filesystem):    Stopped
> pengine[27910]: 2009/08/01_09:11:07 notice: native_print:     Filesystem_6  
>     (ocf::heartbeat:Filesystem):    Stopped
> pengine[27910]: 2009/08/01_09:11:07 notice: native_print:     Filesystem_7  
>     (ocf::heartbeat:Filesystem):    Stopped
> pengine[27910]: 2009/08/01_09:11:07 notice: native_print:     afs_8     
> (lsb:afs):      Stopped
> pengine[27910]: 2009/08/01_09:11:07 notice: native_print:     CheckSun_9    
>     (stonith:external/sbd): Stopped
> pengine[27910]: 2009/08/01_09:11:07 WARN: native_color: Resource 
> IPaddr_141_108_26_27 cannot run anywhere
> pengine[27910]: 2009/08/01_09:11:07 WARN: native_color: Resource 
> Filesystem_2 cannot run anywhere
> pengine[27910]: 2009/08/01_09:11:07 WARN: native_color: Resource 
> Filesystem_3 cannot run anywhere
> pengine[27910]: 2009/08/01_09:11:07 WARN: native_color: Resource 
> Filesystem_4 cannot run anywhere
> pengine[27910]: 2009/08/01_09:11:07 WARN: native_color: Resource 
> Filesystem_5 cannot run anywhere
> pengine[27910]: 2009/08/01_09:11:07 WARN: native_color: Resource 
> Filesystem_6 cannot run anywhere
> pengine[27910]: 2009/08/01_09:11:07 WARN: native_color: Resource 
> Filesystem_7 cannot run anywhere
> pengine[27910]: 2009/08/01_09:11:07 WARN: native_color: Resource afs_8 
> cannot run anywhere
> pengine[27910]: 2009/08/01_09:11:07 WARN: native_color: Resource CheckSun_9 
> cannot run anywhere
> crmd[27895]: 2009/08/01_09:11:07 info: do_state_transition: State 
> transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_P
> E_SUCCESS cause=C_IPC_MESSAGE origin=route_message ]
> tengine[27909]: 2009/08/01_09:11:07 info: process_te_message: Processing 
> graph derived from /var/lib/heartbeat/pengine/pe-warn-
> 3617.bz2
> tengine[27909]: 2009/08/01_09:11:07 info: unpack_graph: Unpacked transition 
> 3: 0 actions in 0 synapses
> tengine[27909]: 2009/08/01_09:11:07 info: run_graph: Transition 3: 
> (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0)
> crmd[27895]: 2009/08/01_09:11:07 info: do_state_transition: State 
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
>  cause=C_IPC_MESSAGE origin=route_message ]
> tengine[27909]: 2009/08/01_09:11:07 info: notify_crmd: Transition 3 status: 
> te_complete - <null>
> pengine[27910]: 2009/08/01_09:11:07 WARN: process_pe_message: Transition 3: 
> WARNINGs found during PE processing. PEngine Input
> stored in: /var/lib/heartbeat/pengine/pe-warn-3617.bz2
> pengine[27910]: 2009/08/01_09:11:07 info: process_pe_message: Configuration 
> WARNINGs found during PE processing.  Please run "c
> rm_verify -L" to identify issues.
> cib[27891]: 2009/08/01_09:14:36 info: cib_stats: Processed 3 operations 
> (13333.00us average, 0% utilization) in the last 10min
>
>
> I don't understand which was the problem on ACTIVE machine and  I don't


It was not reachable. It's not clear why though and this node
can't say why.

> understand either the reason why the service didn't start on STANDBY 
> machine.

Not sure. It's as if there was no stonith resource defined. It
looks like you have shared storage, so you should have fencing
setup, but one can't say without looking at the configuration. At
any rate, the CRM didn't even try to fence the other node.

> Anybody can help to debug and solve the problem ?

You have to prepare a report with hb_report. If it's too big,
don't post it but open a new bugzilla.

Thanks,

Dejan

> thanks
>
> cristina
>
>



> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] help to debug

Reply via email to