Hi,
On Thu, Jul 09, 2009 at 03:55:09PM -0400, Ken P Maier wrote:
> Hi, Sorry if this is a repeat. I have checked the faq.
>
> I get this error 1 out of 20 times when starting heartbeat for the first
> time.
It always happens only on startup?
> I'm running version 2.99.3.
>
> Here's my log.
>
> Jul 9 14:28:45 (none) heartbeat: [760]: WARN: Core dumps could be lost if
> multiple dumps occur.
> Jul 9 14:28:45 (none) heartbeat: [760]: WARN: Consider setting
> non-default value in /proc/sys/kernel/core_pattern (or equivalent) f
> or maximum supportability
> Jul 9 14:28:45 (none) heartbeat: [760]: WARN: Consider setting
> /proc/sys/kernel/core_uses_pid (or equivalent) to 1 for maximum supp
> ortability
> Jul 9 14:28:45 (none) heartbeat: [760]: info: Version 2 support: false
> Jul 9 14:28:45 (none) heartbeat: [760]: WARN: Logging daemon is disabled
> --enabling logging daemon is recommended
> Jul 9 14:28:46 (none) heartbeat: [760]: info: **************************
> Jul 9 14:28:46 (none) heartbeat: [760]: info: Configuration validated.
> Starting heartbeat 2.99.3
> Jul 9 14:28:46 (none) heartbeat: [761]: info: heartbeat: version 2.99.3
> Jul 9 14:28:46 (none) heartbeat: [761]: info: Heartbeat generation:
> 1247147626
> Jul 9 14:50:08 (none) last message repeated 61 timesUDP Broadcast
> heartbeat started on port 694 (694) interface eth1
> Jul 9 14:28:46 (none) heartbeat: [761]: info: glib: UDP Broadcast
> heartbeat closed on port 694 interface eth1 - Status: 1
> Jul 9 14:28:46 (none) heartbeat: [761]: info: glib: ucast: write socket
> priority set to IPTOS_LOWDELAY on eth0
> Jul 9 14:28:46 (none) heartbeat: [761]: info: glib: ucast: bound send
> socket to device: eth0
> Jul 9 14:28:46 (none) heartbeat: [761]: info: glib: ucast: bound receive
> socket to device: eth0
> Jul 9 14:28:46 (none) heartbeat: [761]: info: glib: ucast: started on
> port 694 interface eth0 to 147.139.22.45
> Jul 9 14:28:46 (none) heartbeat: [761]: info: glib: ucast: write socket
> priority set to IPTOS_LOWDELAY on eth0
> Jul 9 14:28:46 (none) heartbeat: [761]: info: glib: ucast: bound send
> socket to device: eth0
> Jul 9 14:28:46 (none) heartbeat: [761]: info: glib: ucast: bound receive
> socket to device: eth0
> Jul 9 14:28:46 (none) heartbeat: [761]: info: glib: ucast: started on
> port 694 interface eth0 to 147.139.22.44
> Jul 9 14:28:46 (none) heartbeat: [761]: info: glib: ping heartbeat
> started.
> Jul 9 14:28:46 (none) heartbeat: [761]: info: G_main_add_TriggerHandler:
> Added signal manual handler
> Jul 9 14:28:46 (none) heartbeat: [761]: info: G_main_add_TriggerHandler:
> Added signal manual handler
> Jul 9 14:28:46 (none) heartbeat: [761]: info: G_main_add_SignalHandler:
> Added signal handler for signal 17
> Jul 9 14:28:46 (none) heartbeat: [761]: info: Local status now set to:
> 'up'
> Jul 9 14:28:47 (none) heartbeat: [761]: info: Link
> 147.139.16.250:147.139.16.250 up.
> Jul 9 14:28:47 (none) heartbeat: [761]: info: Status update for node
> 147.139.16.250: status ping
> Jul 9 14:28:47 (none) heartbeat: [761]: info: Link ems2:eth1 up.
> Jul 9 14:28:48 (none) heartbeat: [764]: ERROR: fifo_child: here???
> Jul 9 14:29:06 (none) last message repeated 18 times
> Jul 9 14:29:06 (none) heartbeat: [761]: WARN: node ems1: is dead
> Jul 9 14:29:06 (none) heartbeat: [761]: info: Comm_now_up(): updating
> status to active
> Jul 9 14:29:06 (none) heartbeat: [761]: info: Local status now set to:
> 'active'
> Jul 9 14:29:06 (none) heartbeat: [761]: info: Starting child client
> "/usr/lib/heartbeat/ipfail" (6343,65)
> Jul 9 14:29:06 (none) heartbeat: [761]: WARN: No STONITH device
> configured.
> Jul 9 14:29:06 (none) heartbeat: [761]: WARN: Shared disks are not
> protected.
> Jul 9 14:29:06 (none) heartbeat: [761]: info: Resources being acquired
> from ems1.
> Jul 9 14:29:06 (none) heartbeat: [775]: info: Starting
> "/usr/lib/heartbeat/ipfail" as uid 6343 gid 65 (pid 775)
> Jul 9 14:29:07 (none) heartbeat: [764]: WARN: ha_msg_add_nv_depth: line
> doesn't contain '='
> Jul 9 14:29:07 (none) heartbeat: [764]: info: >>>
> Jul 9 14:29:07 (none) heartbeat: [764]: ERROR: NV failure (msgfromsteam):
> [>>> ]
> Jul 9 14:29:07 (none) heartbeat: [764]: ERROR: fifo_child: here???
> Jul 9 14:29:07 (none) heartbeat: [777]: info: Local Resource acquisition
> completed.
> Jul 9 14:29:07 (none) heartbeat: [761]: info: Initial resource
> acquisition complete (T_RESOURCES(us))
> Jul 9 14:29:08 (none) heartbeat: [764]: ERROR: fifo_child: here???
> Jul 9 14:29:16 (none) last message repeated 8 times
> Jul 9 14:29:17 (none) heartbeat: [761]: info: Local Resource acquisition
> completed. (none)
> Jul 9 14:29:17 (none) heartbeat: [761]: info: local resource transition
> completed.
> Jul 9 14:29:17 (none) heartbeat: [764]: ERROR: fifo_child: here???
>
>
> Is this a know problem or is it me?
Don't know :) The message heartbeat gets is not formatted
correctly. Can't recall anybody reporting that myself, though I
can see one message with the same footprint in the archives. The
question is who's sending this message. Perhaps you can set debug
to 2 and see if there are any clues in the logs.
Thanks,
Dejan
> I did add some debug in function fifo_child(), as shown below.
>
> /* Make sure we check for death of parent every so often... */
> for (;;) {
>
> setmsalarm(1000L);
> msg = msgfromstream(fifo);
> setmsalarm(0L);
> hb_check_mcp_alive();
> hb_signal_process_pending();
>
> if (msg) {
> IPC_Message* m;
> if (DEBUGDETAILS) {
> cl_log(LOG_DEBUG, "fifo_child message:");
> cl_log_message(LOG_DEBUG, msg);
> }
> m = hamsg2ipcmsg(msg, chan);
> if (m) {
> /* Send frees "m" "at the right time" */
> chan->ops->send(chan, m);
> hb_check_mcp_alive();
> hb_signal_process_pending();
> chan->ops->waitout(chan);
> hb_check_mcp_alive();
> hb_signal_process_pending();
> }
> ha_msg_del(msg);
> msg = NULL;
>
> }else if (feof(fifo)) {
> cl_log(LOG_ERR, "fifo_child: EOF on FIFO");
> if (ANYDEBUG) {
> return_to_orig_privs();
> cl_log(LOG_DEBUG
> , "fifo_child: EOF on FIFO");
> }
> hb_check_mcp_alive();
> exit(2);
> } else {
> cl_log(LOG_ERR, "fifo_child: here???");
> }
> cl_cpu_limit_update();
> cl_realtime_malloc_check();
> hb_check_mcp_alive();
> hb_signal_process_pending();
> }
> /*notreached*/
>
>
> Many Thanks,
> Ken
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems