Last week I emailed the list regarding a node failover that occurred
when IPAddr monitor timed out. At the same time, my log was showing
G_SIG_dispatch delays in lrmd.  The thread ended in a petty discussion
over what was the proper time out value (although all the examples on
linux-ha.org show 3s here, it was suggested that I bump mine from 5s to
15s).

A few minutes ago, I experienced another failover, this one due to drbd
monitor failure. None of my other logs show any kind of disk error. In
fact, my MySQL error log (located on the very drbd disk that failed)
shows the shutdown messages subsequently issued by heartbeat. Again, the
monitor failure occurs at the same time that a G_SIG_display delay
occurs.

Now does anyone have any idea why these errors may be occurring and is
there a way to resolve them.

Please see attached log snippet.


lrmd[11173]: 2008/06/17_13:37:08 WARN: pingd_child:1:monitor process (PID 1001) timed out (try 1).  Killing with signal SIGTERM (15).
lrmd[11173]: 2008/06/17_13:37:09 WARN: G_SIG_dispatch: Dispatch function for SIGCHLD was delayed 1210 ms (> 100 ms) before being called (GSource: 0xc50e9b8)
lrmd[11173]: 2008/06/17_13:37:09 info: G_SIG_dispatch: started at 496169216 should have started at 496169095
lrmd[11173]: 2008/06/17_14:08:16 WARN: pingd_child:1:monitor process (PID 5207) timed out (try 1).  Killing with signal SIGTERM (15).
lrmd[11173]: 2008/06/17_14:08:16 WARN: drbddisk_mysql:monitor process (PID 5205) timed out (try 1).  Killing with signal SIGTERM (15).
lrmd[11173]: 2008/06/17_14:08:17 WARN: There is something wrong: the first line isn't read in. Maybe the heartbeat does not ouput string correctly for status operation. Or the code (myself) is wrong.
lrmd[11173]: 2008/06/17_14:08:17 WARN: operation monitor[35] on heartbeat::drbddisk::drbddisk_mysql for client 11176, its parameters: target_role=[started] CRM_meta_interval=[30000] 1=[mysql_data] CRM_meta_id=[drbddisk_mysql_mon] CRM_meta_timeout=[10000] crm_feature_set=[2.1] CRM_meta_name=[monitor] : pid [5205] timed out
lrmd[11173]: 2008/06/17_14:08:17 WARN: operation monitor[49] on ocf::pingd::pingd_child:1 for client 11176, its parameters: multiplier=[100] CRM_meta_interval=[15000] CRM_meta_prereq=[nothing] dampen=[5s] CRM_meta_id=[pingd_child_mon] CRM_meta_timeout=[5000] crm_feature_set=[2.1] CRM_meta_clone_max=[2] CRM_meta_name=[monitor] CRM_meta_globally_unique=[false] CRM_meta_clone=[1] : pid [5207] timed out
crmd[11176]: 2008/06/17_14:08:17 ERROR: process_lrm_event: LRM operation drbddisk_mysql_monitor_30000 (35) Timed Out (timeout=10000ms)
lrmd[11173]: 2008/06/17_14:08:17 WARN: G_SIG_dispatch: Dispatch function for SIGCHLD took too long to execute: 930 ms (> 300 ms) (GSource: 0xc50e9b8)
crmd[11176]: 2008/06/17_14:08:17 ERROR: process_lrm_event: LRM operation pingd_child:1_monitor_15000 (49) Timed Out (timeout=5000ms)
tengine[17203]: 2008/06/17_14:08:17 info: process_graph_event: Detected action drbddisk_mysql_monitor_30000 from a different transition: 10 vs. 17
tengine[17203]: 2008/06/17_14:08:17 info: update_abort_priority: Abort priority upgraded to 1000000
tengine[17203]: 2008/06/17_14:08:17 WARN: update_failcount: Updating failcount for drbddisk_mysql on 88f6568d-e0ec-40a3-bed5-cd6b0762ef42 after failed monitor: rc=-2
crmd[11176]: 2008/06/17_14:08:17 info: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_IPC_MESSAGE origin=route_message ]
crmd[11176]: 2008/06/17_14:08:17 info: do_state_transition: All 2 cluster nodes are eligible to run resources.
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to