Re: [Linux-HA] help to debug

Cristina Bulfon Tue, 18 Aug 2009 00:05:53 -0700

Ciao,

I just opened a bugzilla as you suggested, hope everything is fine.


thanks

cristina

On Aug 17, 2009, at 11:47 PM, Dejan Muhamedagic wrote:

Hi,

On Mon, Aug 17, 2009 at 03:16:16PM +0200, Cristina Bulfon wrote:
Ciao,
during my holiday I had problem with heartbeat. Checking the log file I
find the following line

- On ACTIVE machine
heartbeat[27868]: 2009/07/19_08:33:43 info: Daily informational memory
statistics
heartbeat[27868]: 2009/07/19_08:33:43 info: MSG stats: 7/663233 ms age 0
[pid27868/MST_CONTROL]
heartbeat[27868]: 2009/07/19_08:33:43 info: cl_malloc stats: 0/0  0/0
[pid27868/MST_CONTROL]
heartbeat[27868]: 2009/07/19_08:33:43 info: RealMalloc stats: 0 total
malloc bytes. pid [27868/MST_CONTROL]
heartbeat[27868]: 2009/07/19_08:33:43 info: Current arena value: 0
......................
heartbeat[27868]: 2009/07/25_08:34:02 info: MSG stats: 0/0 ms age
5158152440 [pid27875/HBREAD]
heartbeat[27868]: 2009/07/25_08:34:02 info: cl_malloc stats: 0/0  0/0
[pid27875/HBREAD]
heartbeat[27868]: 2009/07/25_08:34:02 info: RealMalloc stats: 0 total
malloc bytes. pid [27875/HBREAD]
heartbeat[27868]: 2009/07/25_08:34:02 info: Current arena value: 0
heartbeat[27868]: 2009/07/25_08:34:02 info: These are nothing to worry
about.
lrmd[27881]: 2009/07/26_02:29:11 WARN: G_SIG_dispatch: Dispatch function for SIGCHLD was delayed 1000 ms (> 100 ms) before being called (GSource: 0x
519bc0)
lrmd[27881]: 2009/07/26_02:29:11 info: G_SIG_dispatch: started at 522265947
should have started at 522265847
........
heartbeat[27868]: 2009/08/01_08:34:23 info: MSG stats: 0/0 ms age
5762952450 [pid27875/HBREAD]
heartbeat[27868]: 2009/08/01_08:34:23 info: cl_malloc stats: 0/0  0/0
[pid27875/HBREAD]
heartbeat[27868]: 2009/08/01_08:34:23 info: RealMalloc stats: 0 total
malloc bytes. pid [27875/HBREAD]
heartbeat[27868]: 2009/08/01_08:34:23 info: Current arena value: 0
heartbeat[27868]: 2009/08/01_08:34:23 info: These are nothing to worry
about.

- On STANDBY machine
heartbeat[27879]: 2009/07/19_08:34:13 info: Daily informational memory
statistics
heartbeat[27879]: 2009/07/19_08:34:13 info: MSG stats: 9/663189 ms age 0
[pid27879/MST_CONTROL]
heartbeat[27879]: 2009/07/19_08:34:13 info: cl_malloc stats: 0/0  0/0
[pid27879/MST_CONTROL]
heartbeat[27879]: 2009/07/19_08:34:13 info: RealMalloc stats: 0 total
malloc bytes. pid [27879/MST_CONTROL]
heartbeat[27879]: 2009/07/19_08:34:13 info: Current arena value: 0
............................
heartbeat[27879]: 2009/07/25_08:34:21 info: MSG stats: 0/0 ms age
5158086760 [pid27886/HBREAD]
heartbeat[27879]: 2009/07/25_08:34:21 info: cl_malloc stats: 0/0  0/0
[pid27886/HBREAD]
heartbeat[27879]: 2009/07/25_08:34:21 info: RealMalloc stats: 0 total
malloc bytes. pid [27886/HBREAD]
heartbeat[27879]: 2009/07/25_08:34:21 info: Current arena value: 0
heartbeat[27879]: 2009/07/25_08:34:21 info: These are nothing to worry
about.
............................
heartbeat[27879]: 2009/08/01_09:11:07 WARN: node afsitfs3.roma1.infn.it: is
dead
heartbeat[27879]: 2009/08/01_09:11:07 info: Link
afsitfs3.roma1.infn.it:eth1 dead.
pingd[27889]: 2009/08/01_09:11:07 notice: pingd_nstatus_callback: Status
update: Ping node afsitfs3.roma1.infn.it now has statu
s [dead]
heartbeat[27879]: 2009/08/01_09:11:07 info: Link
afsitfs3.roma1.infn.it:eth0 dead.
pingd[27889]: 2009/08/01_09:11:07 notice: pingd_lstatus_callback: Status
update: Ping node afsitfs3.roma1.infn.it now has statu
s [dead]
pingd[27889]: 2009/08/01_09:11:07 notice: pingd_nstatus_callback: Status
update: Ping node afsitfs3.roma1.infn.it now has statu
s [dead]
crmd[27895]: 2009/08/01_09:11:07 notice: crmd_ha_status_callback: Status
update: Node afsitfs3.roma1.infn.it now has status [de
ad]
pingd[27889]: 2009/08/01_09:11:07 notice: pingd_lstatus_callback: Status
update: Ping node afsitfs3.roma1.infn.it now has statu
s [dead]
pingd[27889]: 2009/08/01_09:11:07 notice: pingd_nstatus_callback: Status
update: Ping node afsitfs3.roma1.infn.it now has statu
s [dead]
cib[27891]: 2009/08/01_09:11:07 info: mem_handle_event: Got an event
OC_EV_MS_INVALID from ccm
cib[27891]: 2009/08/01_09:11:07 info: mem_handle_event: no mbr_track info
cib[27891]: 2009/08/01_09:11:07 info: mem_handle_event: Got an event
OC_EV_MS_NEW_MEMBERSHIP from ccm
cib[27891]: 2009/08/01_09:11:07 info: mem_handle_event: instance=3,
nodes=1, new=0, lost=1, n_idx=0, new_idx=1, old_idx=3
ccm[27890]: 2009/08/01_09:11:07 info: Break tie for 2 nodes cluster
crmd[27895]: 2009/08/01_09:11:07 info: mem_handle_event: Got an event
OC_EV_MS_INVALID from ccm
cib[27891]: 2009/08/01_09:11:07 info: cib_ccm_msg_callback: LOST:
afsitfs3.roma1.infn.it
crmd[27895]: 2009/08/01_09:11:07 info: mem_handle_event: no mbr_track info
cib[27891]: 2009/08/01_09:11:07 info: cib_ccm_msg_callback: PEER:
afsitfs4.roma1.infn.it
crmd[27895]: 2009/08/01_09:11:07 info: mem_handle_event: Got an event
OC_EV_MS_NEW_MEMBERSHIP from ccm
crmd[27895]: 2009/08/01_09:11:07 info: mem_handle_event: instance=3,
nodes=1, new=0, lost=1, n_idx=0, new_idx=1, old_idx=3
crmd[27895]: 2009/08/01_09:11:07 info: crmd_ccm_msg_callback: Quorum
(re)attained after event=NEW MEMBERSHIP (id=3)
crmd[27895]: 2009/08/01_09:11:07 info: erase_node_from_join: Removed dead
node afsitfs3.roma1.infn.it from join calculations: w
elcomed=0 itegrated=0 finalized=0 confirmed=1
crmd[27895]: 2009/08/01_09:11:07 info: ccm_event_detail: NEW MEMBERSHIP:
trans=3, nodes=1, new=0, lost=1 n_idx=0, new_idx=1, ol
d_idx=3
crmd[27895]: 2009/08/01_09:11:07 info: ccm_event_detail: CURRENT:
afsitfs4.roma1.infn.it [nodeid=1, born=3]
crmd[27895]: 2009/08/01_09:11:07 info: ccm_event_detail:        LOST:
afsitfs3.roma1.infn.it [nodeid=0, born=2]
tengine[27909]: 2009/08/01_09:11:07 WARN: match_down_event: No match for
shutdown action on 586817af-703a-4eff-ac9b-b96de063493
a
tengine[27909]: 2009/08/01_09:11:07 info: extract_event: Stonith/ shutdown
of 586817af-703a-4eff-ac9b-b96de063493a not matched
crmd[27895]: 2009/08/01_09:11:07 info: do_state_transition: State
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=
C_IPC_MESSAGE origin=route_message ]
tengine[27909]: 2009/08/01_09:11:07 info: update_abort_priority: Abort
priority upgraded to 1000000
crmd[27895]: 2009/08/01_09:11:07 info: do_state_transition: All 1 cluster
nodes are eligible to run resources.
pengine[27910]: 2009/08/01_09:11:07 info: determine_online_status: Node
afsitfs4.roma1.infn.it is online
pengine[27910]: 2009/08/01_09:11:07 notice: group_print: Resource Group:
group_1
pengine[27910]: 2009/08/01_09:11:07 notice: native_print:
IPaddr_141_108_26_27      (ocf::heartbeat:IPaddr):        Stopped
pengine[27910]: 2009/08/01_09:11:07 notice: native_print: Filesystem_2
   (ocf::heartbeat:Filesystem):    Stopped
pengine[27910]: 2009/08/01_09:11:07 notice: native_print: Filesystem_3
   (ocf::heartbeat:Filesystem):    Stopped
pengine[27910]: 2009/08/01_09:11:07 notice: native_print: Filesystem_4
   (ocf::heartbeat:Filesystem):    Stopped
pengine[27910]: 2009/08/01_09:11:07 notice: native_print: Filesystem_5
   (ocf::heartbeat:Filesystem):    Stopped
pengine[27910]: 2009/08/01_09:11:07 notice: native_print: Filesystem_6
   (ocf::heartbeat:Filesystem):    Stopped
pengine[27910]: 2009/08/01_09:11:07 notice: native_print: Filesystem_7
   (ocf::heartbeat:Filesystem):    Stopped
pengine[27910]: 2009/08/01_09:11:07 notice: native_print:     afs_8
(lsb:afs):      Stopped
pengine[27910]: 2009/08/01_09:11:07 notice: native_print: CheckSun_9
   (stonith:external/sbd): Stopped
pengine[27910]: 2009/08/01_09:11:07 WARN: native_color: Resource
IPaddr_141_108_26_27 cannot run anywhere
pengine[27910]: 2009/08/01_09:11:07 WARN: native_color: Resource
Filesystem_2 cannot run anywhere
pengine[27910]: 2009/08/01_09:11:07 WARN: native_color: Resource
Filesystem_3 cannot run anywhere
pengine[27910]: 2009/08/01_09:11:07 WARN: native_color: Resource
Filesystem_4 cannot run anywhere
pengine[27910]: 2009/08/01_09:11:07 WARN: native_color: Resource
Filesystem_5 cannot run anywhere
pengine[27910]: 2009/08/01_09:11:07 WARN: native_color: Resource
Filesystem_6 cannot run anywhere
pengine[27910]: 2009/08/01_09:11:07 WARN: native_color: Resource
Filesystem_7 cannot run anywhere
pengine[27910]: 2009/08/01_09:11:07 WARN: native_color: Resource afs_8
cannot run anywhere
pengine[27910]: 2009/08/01_09:11:07 WARN: native_color: Resource CheckSun_9
cannot run anywhere
crmd[27895]: 2009/08/01_09:11:07 info: do_state_transition: State
transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_P
E_SUCCESS cause=C_IPC_MESSAGE origin=route_message ]
tengine[27909]: 2009/08/01_09:11:07 info: process_te_message: Processing
graph derived from /var/lib/heartbeat/pengine/pe-warn-
3617.bz2
tengine[27909]: 2009/08/01_09:11:07 info: unpack_graph: Unpacked transition
3: 0 actions in 0 synapses
tengine[27909]: 2009/08/01_09:11:07 info: run_graph: Transition 3:
(Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0)
crmd[27895]: 2009/08/01_09:11:07 info: do_state_transition: State
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
cause=C_IPC_MESSAGE origin=route_message ]
tengine[27909]: 2009/08/01_09:11:07 info: notify_crmd: Transition 3 status:
te_complete - <null>
pengine[27910]: 2009/08/01_09:11:07 WARN: process_pe_message: Transition 3:
WARNINGs found during PE processing. PEngine Input
stored in: /var/lib/heartbeat/pengine/pe-warn-3617.bz2
pengine[27910]: 2009/08/01_09:11:07 info: process_pe_message: Configuration
WARNINGs found during PE processing.  Please run "c
rm_verify -L" to identify issues.
cib[27891]: 2009/08/01_09:14:36 info: cib_stats: Processed 3 operations
(13333.00us average, 0% utilization) in the last 10min
I don't understand which was the problem on ACTIVE machine and I don't
It was not reachable. It's not clear why though and this node
can't say why.
understand either the reason why the service didn't start on STANDBY
machine.
Not sure. It's as if there was no stonith resource defined. It
looks like you have shared storage, so you should have fencing
setup, but one can't say without looking at the configuration. At
any rate, the CRM didn't even try to fence the other node.
Anybody can help to debug and solve the problem ?
You have to prepare a report with hb_report. If it's too big,
don't post it but open a new bugzilla.

Thanks,

Dejan
thanks

cristina
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] help to debug

Reply via email to