Although Linux-HA for Solaris works OK on simple configurations, I've never yet had it pass BasicSanityCheck.
'heartbeat' is getting stuck (I think looping, judging by the rapidly growing logfile). Something (I know not what) is triggering it to produce the "Cannot write to media pipe" error (heartbeat.c around line 1608). 1. It would be nice to know what causes that initial failure (but I suspect you'll need more information, won't you... :-) 2. From that point, the log file starts growing rapidly: heartbeat[26575]: 2007/07/10_15:53:30 ERROR: Cannot write to media pipe 2: Resource temporarily unavailable heartbeat[26575]: 2007/07/10_15:53:30 ERROR: Shutting down. heartbeat[26575]: 2007/07/10_15:53:30 debug: hb_initiate_shutdown() called. heartbeat[26575]: 2007/07/10_15:53:30 ERROR: Cannot write to media pipe 2: Resource temporarily unavailable heartbeat[26575]: 2007/07/10_15:53:30 ERROR: Shutting down. heartbeat[26575]: 2007/07/10_15:53:30 debug: hb_initiate_shutdown() called. heartbeat[26575]: 2007/07/10_15:53:30 debug: hb_initiate_shutdown(): shutdown already in progress [...] That second set, from "Cannot write" to "shutdown already in progress" keeps recurring, several times per second. So are "send_to_all_media()" and "hb_initiate_shutdown()" recursing without bound? I suspect so, which would be a bug. Possible side-issue (or related?). Right from the moment it starts until the above failure, there are messages of the form: heartbeat[26575]: 2007/07/10_15:53:07 WARN: Gmain_timeout_dispatch: Dispatch function for check for signals was delayed 20 ms (> 15 ms) before being called (GSource: 0xe9e10) heartbeat[26575]: 2007/07/10_15:53:07 info: Gmain_timeout_dispatch: started at 33394964 should have started at 33394962 heartbeat[26575]: 2007/07/10_15:53:08 WARN: Gmain_timeout_dispatch: Dispatch function for send local status was delayed 20 ms (> 15 ms) before being called (GSource: 0xe9b10) heartbeat[26575]: 2007/07/10_15:53:08 info: Gmain_timeout_dispatch: started at 33394971 should have started at 33394969 heartbeat[26575]: 2007/07/10_15:53:08 WARN: Gmain_timeout_dispatch: Dispatch function for check for signals was delayed 20 ms (> 15 ms) before being called (GSource: 0xe9e10) heartbeat[26575]: 2007/07/10_15:53:08 info: Gmain_timeout_dispatch: started at 33394971 should have started at 33394969 heartbeat[26575]: 2007/07/10_15:53:08 WARN: Gmain_timeout_dispatch: Dispatch function for send local status was delayed 20 ms (> 15 ms) before being called (GSource: 0xe9b10) The functions and times vary. This is an old, slow, small-memory machine. Are the checks too tight? Or is something more fundamentally wrong? (Usual excuses: heartbeat is a spare-time (not much of it) activity; that time is mostly spent chasing aspects of portability rather than runtime.) -- : David Lee I.T. Service : : Senior Systems Programmer Computer Centre : : UNIX Team Leader Durham University : : South Road : : http://www.dur.ac.uk/t.d.lee/ Durham DH1 3LE : : Phone: +44 191 334 2752 U.K. : _______________________________________________________ Linux-HA-Dev: [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
