Hi, On Tue, Jun 17, 2008 at 02:29:49PM -0400, Greg Haase wrote: > Last week I emailed the list regarding a node failover that occurred > when IPAddr monitor timed out. At the same time, my log was showing > G_SIG_dispatch delays in lrmd. The thread ended in a petty discussion > over what was the proper time out value (although all the examples on > linux-ha.org show 3s here, it was suggested that I bump mine from 5s to > 15s).
Hmpf, those examples should be updated. The default action timeout is set to 20s. See /usr/lib/heartbeat/pengine metadata. > A few minutes ago, I experienced another failover, this one due to drbd > monitor failure. None of my other logs show any kind of disk error. In > fact, my MySQL error log (located on the very drbd disk that failed) > shows the shutdown messages subsequently issued by heartbeat. Again, the > monitor failure occurs at the same time that a G_SIG_display delay > occurs. > > Now does anyone have any idea why these errors may be occurring and is > there a way to resolve them. I can recall that this node is beefy (right?), but the only explanation I can offer is that the lrmd processing is delayed. Those warnings mean exactly what they say, i.e. that a signal handler was invoked later than what is normally expected. Note that they do usually mean that the node is overwhelmed and can't cope with the load. One possible exception is if there's a bug. There used to be one, but it has been fixed before the 2.1.3 release (I suppose that you're using that one). If the process (in this case an instance of the drbddisk resource agent running a monitor operation) times out, there's nothing one can do about it, but consider that the operation has failed. Why the process times out, I really can't say. That depends on what's happening on the computer. Perhaps the kernel hasn't scheduled it (it used to happen even with heartbeats with one kernel release). BTW, you never posted a full report including all logs, configuration, etc. You can use hb_report to create one and open a bugzilla. Thanks, Dejan > Please see attached log snippet. > > > lrmd[11173]: 2008/06/17_13:37:08 WARN: pingd_child:1:monitor process (PID > 1001) timed out (try 1). Killing with signal SIGTERM (15). > lrmd[11173]: 2008/06/17_13:37:09 WARN: G_SIG_dispatch: Dispatch function for > SIGCHLD was delayed 1210 ms (> 100 ms) before being called (GSource: > 0xc50e9b8) > lrmd[11173]: 2008/06/17_13:37:09 info: G_SIG_dispatch: started at 496169216 > should have started at 496169095 > lrmd[11173]: 2008/06/17_14:08:16 WARN: pingd_child:1:monitor process (PID > 5207) timed out (try 1). Killing with signal SIGTERM (15). > lrmd[11173]: 2008/06/17_14:08:16 WARN: drbddisk_mysql:monitor process (PID > 5205) timed out (try 1). Killing with signal SIGTERM (15). > lrmd[11173]: 2008/06/17_14:08:17 WARN: There is something wrong: the first > line isn't read in. Maybe the heartbeat does not ouput string correctly for > status operation. Or the code (myself) is wrong. > lrmd[11173]: 2008/06/17_14:08:17 WARN: operation monitor[35] on > heartbeat::drbddisk::drbddisk_mysql for client 11176, its parameters: > target_role=[started] CRM_meta_interval=[30000] 1=[mysql_data] > CRM_meta_id=[drbddisk_mysql_mon] CRM_meta_timeout=[10000] > crm_feature_set=[2.1] CRM_meta_name=[monitor] : pid [5205] timed out > lrmd[11173]: 2008/06/17_14:08:17 WARN: operation monitor[49] on > ocf::pingd::pingd_child:1 for client 11176, its parameters: multiplier=[100] > CRM_meta_interval=[15000] CRM_meta_prereq=[nothing] dampen=[5s] > CRM_meta_id=[pingd_child_mon] CRM_meta_timeout=[5000] crm_feature_set=[2.1] > CRM_meta_clone_max=[2] CRM_meta_name=[monitor] > CRM_meta_globally_unique=[false] CRM_meta_clone=[1] : pid [5207] timed out > crmd[11176]: 2008/06/17_14:08:17 ERROR: process_lrm_event: LRM operation > drbddisk_mysql_monitor_30000 (35) Timed Out (timeout=10000ms) > lrmd[11173]: 2008/06/17_14:08:17 WARN: G_SIG_dispatch: Dispatch function for > SIGCHLD took too long to execute: 930 ms (> 300 ms) (GSource: 0xc50e9b8) > crmd[11176]: 2008/06/17_14:08:17 ERROR: process_lrm_event: LRM operation > pingd_child:1_monitor_15000 (49) Timed Out (timeout=5000ms) > tengine[17203]: 2008/06/17_14:08:17 info: process_graph_event: Detected > action drbddisk_mysql_monitor_30000 from a different transition: 10 vs. 17 > tengine[17203]: 2008/06/17_14:08:17 info: update_abort_priority: Abort > priority upgraded to 1000000 > tengine[17203]: 2008/06/17_14:08:17 WARN: update_failcount: Updating > failcount for drbddisk_mysql on 88f6568d-e0ec-40a3-bed5-cd6b0762ef42 after > failed monitor: rc=-2 > crmd[11176]: 2008/06/17_14:08:17 info: do_state_transition: State transition > S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_IPC_MESSAGE > origin=route_message ] > crmd[11176]: 2008/06/17_14:08:17 info: do_state_transition: All 2 cluster > nodes are eligible to run resources. > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
