On a new cluster we are building right now I moved
from Ganglia 3.6.1 to 3.7.2. 3.6.1 has been rock-solid on
previous clusters. After 3.7.2 gmond has been up for a short
period of time, it begins emitting the error message:
=========
If I enable debugging (e.g. -d 4) I'm shown the parsed contents
of the spoof string -- and they are non-zero garbage strings.
Doing some gdb tracing with breakpoints on that error message,
the metric_id passed to the function has non-zero .spoof and the
.host value is a garbage string.
In one trace, the .host was an empty string (""); the code
in Ganglia_host_get() assumes that if .spoof is non-zero, then
.host is non-null and a string with length > 0. So the
subsequent code:
spoof_info_len =
strlen(metric_id->host);
buff = malloc(spoof_info_len+1);
strncpy(buff, metric_id->host, spoof_info_len + 1);
spoofIP = buff;
if( !(spoofName = strchr(buff+1,':')) ){
can produce a buffer overrun for a zero-length string.
To isolate possible reasons for the botched
spoofing hostname I compared the gmond/gmond.c source between
3.6.1 and 3.7.2. In Ganglia_collection_group_send()
the following code
name =
cb->msg.Ganglia_value_msg_u.gstr.metric_id.name;
if (override_hostname != NULL)
{
cb->msg.Ganglia_value_msg_u.gstr.metric_id.host
= apr_pstrcat(gm_pool, (char *)( override_ip != NULL ?
override_ip : override_hostname ), ":", (char *)
override_hostname, NULL);
cb->msg.Ganglia_value_msg_u.gstr.metric_id.spoof
= TRUE;
}
is allocating the callback's .host field from
the temporary metrics APR pool; but the callback is external
to this function and lives on beyond the destruction of that
temporary APR pool. Eventually the memory behind cb->msg.Ganglia_value_msg_u.gstr.metric_id.host
will be reused and overwritten, yielding the "garbage
string" condition that's being observed. In 3.6.1, the
.host field was allocated from global_context. If I
modified the code cited above to use global_context rather
than gm_pool, gmond runs without throwing "Incorrect
format for spoof argument" errors.
Also, in
lib/libgmond.c the static global "myhost"
static char myhost[APRMAXHOSTLEN+1];
is assumed by the rest of the code to have been
initialized by the compiler to be a zero-length string:
apr_gethostname( (char*)myhost,
APRMAXHOSTLEN+1, gm_pool);
Probably best to be explicit about the initial
value of myhost and not assume an initial value?
static char myhost[APRMAXHOSTLEN+1]
= "";
Happy to contribute patch files, etc.
::::::::::::::::::::::::::::::::::::::::::::::::::::::
Jeffrey T. Frey, Ph.D.
Systems Programmer V / HPC Management
Network & Systems Services / College of Engineering
University of Delaware, Newark DE 19716
Office: (302) 831-6034 Mobile: (302) 419-4976
::::::::::::::::::::::::::::::::::::::::::::::::::::::