RE: [Ganglia-general] intermittent blanks in graphs

Richard.Grevis Wed, 25 Jan 2006 04:19:03 -0800

There is another way this failure can occur, although it is unlikely
(it happened to me though).

gmond appears to do a reverse IP lookup of the udp packets'
source address to generate the hostname in the XML. We had an error
in the reverse DNS, and 2 separate hosts in the cluster ended up having
the same hostname. As soon as the duplicate hostname was encountered
(even though the IP differed)
gmetad tried to update the rrd with data from the same second, causing
the
failure already described.

So also check your XML for duplicate hostnames.

I fixed my DNS of course, but frankly I also just patched
"gmetad/rrd_helpers.c"
function RRD_update to never return an error. Crude, wrong, but it was a
quick way
to stop gmetad bombing on the rest of the data.

regards,
richard

-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Ben
Hartshorne
Sent: 25 January 2006 03:08
To: [email protected]
Subject: Re: [Ganglia-general] intermittent blanks in graphs

Everyone,

thanks very much for your suggestions.  I've replied to each below.

On Tue, Jan 24, 2006 at 04:16:08AM -0800, Martin Knoblauch wrote:
>  just a thought - are your cluster nodes time-synched? Are they 
> [still] in-synch?

to within a second or so.  I also have several gmetrics that are running
at a 2-min interval, and they exhibit the same behavior.  I would be
suprised to see them reporting the same second, 2 minutes apart...

On Tue, Jan 24, 2006 at 07:45:31AM -0500, Woods, Jeff wrote:
> We had a similar problem a few weeks ago, except that our gmetad never

> seemed to recover.  It was crashing, and had to be restarted manually 
> almost daily.  I enabled the debug output to syslog, but received no 
> indication of what was failing -- it just quit!

restarting the server doesn't seem to have any effect.  :(

> At the time, we were in the process of consolidating our gmetad's to a

> single server (we have three clusters being monitored, and each had 
> its own gmetad and web interface).  Following the migration to the new

> server, the problem went away so we never followed up.

I intend to migrate to a new server soon as well... Of course, that's
one of those projects that's going to happen Real Soon Now(tm).   I'm
worried though, because I realized today that a second instance of
ganglia I've got running on a completely separate network is also
showing these symptoms.  Different hardware, different network,
different switches, different load, same OS (mostly.  Fedora core 3/4).

> The gmetad we had problems with worked reliably for nearly a year 
> before having the problems.  Once the problem started, it occurred 
> reliably (nearly every night).  I could reenable the interface if it 
> might help to resolve a bigger problem.

Thanks for the offer, but I'll do some more poking before putting you to
that trouble.  It's just such a wierd problem...

On Tue, Jan 24, 2006 at 04:46:50PM -0500, Rick Mohr wrote:
> Also, you could use rrdtool to generate the exact same graph that is 
> shown
> on the web page for one of these metrice and dump it straight into a
file.  
> Then you could compare that with the image seen on the web page (to
check 
> for the unlikely event that the generated image if fine, but the web
server 
> is messing something up).

hmm... that's a good suggestion.  

Here's an excerpt from 'rrdtool dump':

<!-- 2006-01-24 17:36:45 PST / 1138153005 --> <row><v> 9.3154666667e+00
</v></row>
<!-- 2006-01-24 17:37:00 PST / 1138153020 --> <row><v> 8.8000000000e+00
</v></row>
<!-- 2006-01-24 17:37:15 PST / 1138153035 --> <row><v> 8.8000000000e+00
</v></row>
<!-- 2006-01-24 17:37:30 PST / 1138153050 --> <row><v> 8.8000000000e+00
</v></row>
<!-- 2006-01-24 17:37:45 PST / 1138153065 --> <row><v> 8.8000000000e+00
</v></row>
<!-- 2006-01-24 17:38:00 PST / 1138153080 --> <row><v> NaN </v></row>
<!-- 2006-01-24 17:38:15 PST / 1138153095 --> <row><v> NaN </v></row>
<!-- 2006-01-24 17:38:30 PST / 1138153110 --> <row><v> NaN </v></row>
<!-- 2006-01-24 17:38:45 PST / 1138153125 --> <row><v> NaN </v></row>
<!-- 2006-01-24 17:39:00 PST / 1138153140 --> <row><v> NaN </v></row>

Correspondingly, in the graph seen through ganglia, the data ends about
17:38.  I'm suprised it's registering these things every 15 seconds!  I
thought the period was slower than that (every min).

I checked a few other rrds at different resolutions, and the NaN
sections do correspond to the blank parts.

So what does it mean?  This tells us that the data is not getting put
into the rrds.  We know that the values are getting to the collector
host, because clicking on the 'gmetric' portion of the website shows
current data.  But that data is not making it into the RRD somehow...

I thought maybe the RRDs had become corrupted somehow, so tried out
moving the rrds out of place so ganglia would recreate them all.  The
symptom was still in evidence.

On Tue, Jan 24, 2006 at 01:56:08PM -0800, steven wagner wrote:
> Running gmetad in the foreground with a very high debug level may 
> offer
> additional clues.  Also, keep an eye on the modification times on the 
> RRD files that are gapping.

I can't see anything too interesting running gmetad in the foreground
with debugging set to '9'.  :(

modification time of the rrd files seem to be current.  This matches the
rrd dump showing 'NaN' in all those fields instead of something
unmodified.

On Tue, Jan 24, 2006 at 05:06:48PM -0500, Jason A. Smith wrote:
> I have seen gaps sometimes.  They almost always happen when gmetad 
> gets data from a cluster that has the same exact timestamp as its last

> update.  Look in your system logs for gmetad errors like:
> 
> /usr/sbin/gmetad[7695]: RRD_update (/var/lib/ganglia/rrds/Cluster
> Name/hostname/metric_name.rrd): illegal attempt to update using time 
> 1138138243 when last update time is 1138138243 (minimum one second 
> step)

I don't see that error message, but while looking for it, I did see this
error message:

Jan 24 17:24:18 localhost /usr/sbin/gmetad[30443]: RRD_update
(/var/lib/ganglia/rrds/production/raiden-8-db1/users.rrd): conversion of
'min,' to float not complete: tail 'min,'

This seems to relate to a recent change I made that I had forgotten
about.  :)  I added the following line to my crontab:

*/2 * * * * /usr/bin/gmetric --name="users" --value=`w | head -1 | awk
'{print $6}'` --type=int16

The purpose of this line is to create a graph representing the number of
logged in users to the host.  it seems right to me - do any of you see a
problem with this line?

In the course of this investigation, I have come across another stange
happening.  Some of the metrics seem to be ... off.  I have no idea if
these things are related. I was suprised to notice that many of my
servers show excessive time in the CPU_report graph as having all their
time spent in CPU Wait.  That didn't seem right and also didn't jive
with the output of vmstat.  Looking at the individual metrics that make
up the cpu_report, I see:

* cpu_aidle: 1388
* cpu_idle: 66.00
* cpu_nice: 0.00
* cpu_system: 2.30
* cpu_user: 31.70
* cpu_wio: 1388

All 6 of these metrics are supposed to be percentages.  What's up with
1,388?  Bouth cpu_aidle and cpu_wio are linearly decreasing graphs with
the same slope (and same current value).  They look to be the same back
into the shown history, but it's hard to be exact.  This seems to be the
case (with different current values) on a number of hosts.  

Two .pngs of hosts exhibiting this behavior are at
http://cryptio.net/~ben/ganglia/host_report.png and
http://cryptio.net/~ben/ganglia/host_report2.png

Note that these stats are all created since I moved the old files out of
place earlier today, so there is no chance of left over corruption.  

Are my hosts dying?  restarting gmond on the host seems to have no
effect.

Would it be possible to create this kind of error by upgrading the
server to gmetad 3.0.2 but leaving the clients running gmond 3.0.1? Yes,
it seems to be the case.  Upgrading the reporting host to 3.0.2 fixes
the strange cpu report symptom.  That's kind of unfortunate, that the
gmond and gmetad are not compatible across a minor version like that.
:(

ok, I think that's enough for this post.

Again, thanks for all your help and insight.

-ben

-- 
Ben Hartshorne
email: [EMAIL PROTECTED]
http://ben.hartshorne.net

------------------------------------------------------------------------
For more information about Barclays Capital, please
visit our web site at http://www.barcap.com.

Internet communications are not secure and therefore the Barclays 
Group does not accept legal responsibility for the contents of this 
message.  Although the Barclays Group operates anti-virus programmes, 
it does not accept responsibility for any damage whatsoever that is 
caused by viruses being passed.  Any views or opinions presented are 
solely those of the author and do not necessarily represent those of the 
Barclays Group.  Replies to this email may be monitored by the Barclays 
Group for operational or business reasons.

------------------------------------------------------------------------

RE: [Ganglia-general] intermittent blanks in graphs

Reply via email to