Although Linux-HA for Solaris works OK on simple configurations, I've
never yet had it pass BasicSanityCheck.

'heartbeat' is getting stuck (I think looping, judging by the rapidly
growing logfile).  Something (I know not what) is triggering it to
produce the "Cannot write to media pipe" error (heartbeat.c around line
1608).

1. It would be nice to know what causes that initial failure (but I
suspect you'll need more information, won't you... :-)

2. From that point, the log file starts growing rapidly:

   heartbeat[26575]: 2007/07/10_15:53:30 ERROR: Cannot write to media pipe 2: 
Resource temporarily unavailable
   heartbeat[26575]: 2007/07/10_15:53:30 ERROR: Shutting down.
   heartbeat[26575]: 2007/07/10_15:53:30 debug: hb_initiate_shutdown() called.

   heartbeat[26575]: 2007/07/10_15:53:30 ERROR: Cannot write to media pipe 2: 
Resource temporarily unavailable
   heartbeat[26575]: 2007/07/10_15:53:30 ERROR: Shutting down.
   heartbeat[26575]: 2007/07/10_15:53:30 debug: hb_initiate_shutdown() called.
   heartbeat[26575]: 2007/07/10_15:53:30 debug: hb_initiate_shutdown(): 
shutdown already in progress
   [...]

That second set, from "Cannot write" to "shutdown already in progress"
keeps recurring, several times per second.

So are "send_to_all_media()" and "hb_initiate_shutdown()" recursing
without bound?  I suspect so, which would be a bug.


Possible side-issue (or related?).  Right from the moment it starts until
the above failure, there are messages of the form:

   heartbeat[26575]: 2007/07/10_15:53:07 WARN: Gmain_timeout_dispatch: Dispatch 
function for check for signals was delayed 20 ms (> 15 ms) before being called 
(GSource: 0xe9e10)
   heartbeat[26575]: 2007/07/10_15:53:07 info: Gmain_timeout_dispatch: started 
at 33394964 should have started at 33394962
   heartbeat[26575]: 2007/07/10_15:53:08 WARN: Gmain_timeout_dispatch: Dispatch 
function for send local status was delayed 20 ms (> 15 ms) before being called 
(GSource: 0xe9b10)
   heartbeat[26575]: 2007/07/10_15:53:08 info: Gmain_timeout_dispatch: started 
at 33394971 should have started at 33394969
   heartbeat[26575]: 2007/07/10_15:53:08 WARN: Gmain_timeout_dispatch: Dispatch 
function for check for signals was delayed 20 ms (> 15 ms) before being called 
(GSource: 0xe9e10)
   heartbeat[26575]: 2007/07/10_15:53:08 info: Gmain_timeout_dispatch: started 
at 33394971 should have started at 33394969
   heartbeat[26575]: 2007/07/10_15:53:08 WARN: Gmain_timeout_dispatch: Dispatch 
function for send local status was delayed 20 ms (> 15 ms) before being called 
(GSource: 0xe9b10)

The functions and times vary.

This is an old, slow, small-memory machine.  Are the checks too tight?  Or
is something more fundamentally wrong?

(Usual excuses: heartbeat is a spare-time (not much of it) activity; that
time is mostly spent chasing aspects of portability rather than runtime.)


-- 

:  David Lee                                I.T. Service          :
:  Senior Systems Programmer                Computer Centre       :
:  UNIX Team Leader                         Durham University     :
:                                           South Road            :
:  http://www.dur.ac.uk/t.d.lee/            Durham DH1 3LE        :
:  Phone: +44 191 334 2752                  U.K.                  :
_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Reply via email to