On Thu, 6 Apr 2006, Alan Robertson wrote:
> Andrew Beekhof wrote:
> > On 3/8/06, Joachim Banzhaf (compuserve) <[EMAIL PROTECTED]> wrote:
> >> Am Dienstag, 7. M?rz 2006 18:08 schrieb Andrew Beekhof:
> >>> On 3/1/06, David Lee <[EMAIL PROTECTED]> wrote:
> >>>> On Wed, 1 Mar 2006, mkinikoglu wrote:
> >>>>> i setup linux-ha to two solaris boxes. (5.9 sparc). when i start
> >>>>> heartbeat i got these errors,
> >>>>> what does it mean return code 139?
> >>>> The meanings of such code, and the use of "crmd" are not my particular
> >>>> area.
> >>> Even with the crmd being my area, 139 still doesnt mean anything to me.
> >>> Were there any logs from the crmd and or cib?
> >> I guess crmd received signal 11 (139 - 128).
> >
> > In which case there should definitely be a core file... but Heartbeat
> > will normally indicate that. Odd.
>
> This is Solaris. Maybe core files are disabled on his machine?
(Resurrecting a thread from three weeks ago)
The original user's problem (or something very like it) has now also
occured for me, so I've taken a deeper look.
(BTW: yes, it did drop a core file... nice.)
The gdb traceback is:
------------------------------
#0 0xfefb44e4 in strlen () from /usr/lib/libc.so.1
(gdb) where
#0 0xfefb44e4 in strlen () from /usr/lib/libc.so.1
#1 0xff006c30 in _doprnt () from /usr/lib/libc.so.1
#2 0xff008ca0 in vsnprintf () from /usr/lib/libc.so.1
#3 0xff36f954 in cl_log (priority=7,
fmt=0x1d848 "recv msg %s from %s, status:%s")
at ../../../lib/clplumbing/cl_log.c:584
#4 0x00013028 in ccm_control_process (info=0x3aaa0, hb=0x32b70)
at ../../../membership/ccm/ccm.c:133
#5 0xff3697e0 in G_CH_dispatch_int (source=0x371e8, callback=0, user_data=0x0)
at ../../../lib/clplumbing/GSource.c:610
#6 0xff244220 in g_main_dispatch () from /opt/csw/lib/libglib-2.0.so.0
#7 0xff245ad8 in g_main_context_dispatch () from /opt/csw/lib/libglib-2.0.so.0
#8 0xff246150 in g_main_context_iterate () from /opt/csw/lib/libglib-2.0.so.0
#9 0xff246ac8 in g_main_loop_run () from /opt/csw/lib/libglib-2.0.so.0
#10 0x00015d14 in main (argc=1, argv=0xffbff9bc)
at ../../../membership/ccm/ccmmain.c:287
(gdb)
------------------------------
In "membership/ccm/ccm.c" (gdb frame #4 above) the code is:
------------------------------
type = ha_msg_value(msg, F_TYPE);
orig = ha_msg_value(msg, F_ORIG);
status = ha_msg_value(msg, F_STATUS);
ccm_debug(LOG_DEBUG, "recv msg %s from %s, status:%s"
, type, orig, status);
------------------------------
Looking at the values:
------------------------------
(gdb) print type
$3 = 0x39da8 "resource"
(gdb) print orig
$4 = 0x36928 "shiel"
(gdb) print status
$5 = 0x0
(gdb)
------------------------------
So that's the problem: calling a "printf"-like routine with a null pointer
(variable "status") for a "%s" value. A null "%s" is technically illegal.
(Now it may be that some OS implementations of "vsnprintf()" etc. try to
be "helpful" and to tolerate this, but this simply masks a lurking
portability problem.)
A quick fix would be adjust this calling code in "membership/ccm/ccm.c" to
convert a null-pointer to a pointer-to-null. But is this the best
solution?
Should "status" ever to a null-pointer?
What other occurences may lurk?
Etc.
Advice welcome!
--
: David Lee I.T. Service :
: Senior Systems Programmer Computer Centre :
: Durham University :
: http://www.dur.ac.uk/t.d.lee/ South Road :
: Durham DH1 3LE :
: Phone: +44 191 334 2752 U.K. :
_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/