Last week I emailed the list regarding a node failover that occurred when IPAddr monitor timed out. At the same time, my log was showing G_SIG_dispatch delays in lrmd. The thread ended in a petty discussion over what was the proper time out value (although all the examples on linux-ha.org show 3s here, it was suggested that I bump mine from 5s to 15s).
A few minutes ago, I experienced another failover, this one due to drbd monitor failure. None of my other logs show any kind of disk error. In fact, my MySQL error log (located on the very drbd disk that failed) shows the shutdown messages subsequently issued by heartbeat. Again, the monitor failure occurs at the same time that a G_SIG_display delay occurs. Now does anyone have any idea why these errors may be occurring and is there a way to resolve them. Please see attached log snippet.
lrmd[11173]: 2008/06/17_13:37:08 WARN: pingd_child:1:monitor process (PID 1001) timed out (try 1). Killing with signal SIGTERM (15). lrmd[11173]: 2008/06/17_13:37:09 WARN: G_SIG_dispatch: Dispatch function for SIGCHLD was delayed 1210 ms (> 100 ms) before being called (GSource: 0xc50e9b8) lrmd[11173]: 2008/06/17_13:37:09 info: G_SIG_dispatch: started at 496169216 should have started at 496169095 lrmd[11173]: 2008/06/17_14:08:16 WARN: pingd_child:1:monitor process (PID 5207) timed out (try 1). Killing with signal SIGTERM (15). lrmd[11173]: 2008/06/17_14:08:16 WARN: drbddisk_mysql:monitor process (PID 5205) timed out (try 1). Killing with signal SIGTERM (15). lrmd[11173]: 2008/06/17_14:08:17 WARN: There is something wrong: the first line isn't read in. Maybe the heartbeat does not ouput string correctly for status operation. Or the code (myself) is wrong. lrmd[11173]: 2008/06/17_14:08:17 WARN: operation monitor[35] on heartbeat::drbddisk::drbddisk_mysql for client 11176, its parameters: target_role=[started] CRM_meta_interval=[30000] 1=[mysql_data] CRM_meta_id=[drbddisk_mysql_mon] CRM_meta_timeout=[10000] crm_feature_set=[2.1] CRM_meta_name=[monitor] : pid [5205] timed out lrmd[11173]: 2008/06/17_14:08:17 WARN: operation monitor[49] on ocf::pingd::pingd_child:1 for client 11176, its parameters: multiplier=[100] CRM_meta_interval=[15000] CRM_meta_prereq=[nothing] dampen=[5s] CRM_meta_id=[pingd_child_mon] CRM_meta_timeout=[5000] crm_feature_set=[2.1] CRM_meta_clone_max=[2] CRM_meta_name=[monitor] CRM_meta_globally_unique=[false] CRM_meta_clone=[1] : pid [5207] timed out crmd[11176]: 2008/06/17_14:08:17 ERROR: process_lrm_event: LRM operation drbddisk_mysql_monitor_30000 (35) Timed Out (timeout=10000ms) lrmd[11173]: 2008/06/17_14:08:17 WARN: G_SIG_dispatch: Dispatch function for SIGCHLD took too long to execute: 930 ms (> 300 ms) (GSource: 0xc50e9b8) crmd[11176]: 2008/06/17_14:08:17 ERROR: process_lrm_event: LRM operation pingd_child:1_monitor_15000 (49) Timed Out (timeout=5000ms) tengine[17203]: 2008/06/17_14:08:17 info: process_graph_event: Detected action drbddisk_mysql_monitor_30000 from a different transition: 10 vs. 17 tengine[17203]: 2008/06/17_14:08:17 info: update_abort_priority: Abort priority upgraded to 1000000 tengine[17203]: 2008/06/17_14:08:17 WARN: update_failcount: Updating failcount for drbddisk_mysql on 88f6568d-e0ec-40a3-bed5-cd6b0762ef42 after failed monitor: rc=-2 crmd[11176]: 2008/06/17_14:08:17 info: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_IPC_MESSAGE origin=route_message ] crmd[11176]: 2008/06/17_14:08:17 info: do_state_transition: All 2 cluster nodes are eligible to run resources.
_______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
