There is another way this failure can occur, although it is unlikely (it happened to me though).
gmond appears to do a reverse IP lookup of the udp packets' source address to generate the hostname in the XML. We had an error in the reverse DNS, and 2 separate hosts in the cluster ended up having the same hostname. As soon as the duplicate hostname was encountered (even though the IP differed) gmetad tried to update the rrd with data from the same second, causing the failure already described. So also check your XML for duplicate hostnames. I fixed my DNS of course, but frankly I also just patched "gmetad/rrd_helpers.c" function RRD_update to never return an error. Crude, wrong, but it was a quick way to stop gmetad bombing on the rest of the data. regards, richard -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Ben Hartshorne Sent: 25 January 2006 03:08 To: [email protected] Subject: Re: [Ganglia-general] intermittent blanks in graphs Everyone, thanks very much for your suggestions. I've replied to each below. On Tue, Jan 24, 2006 at 04:16:08AM -0800, Martin Knoblauch wrote: > just a thought - are your cluster nodes time-synched? Are they > [still] in-synch? to within a second or so. I also have several gmetrics that are running at a 2-min interval, and they exhibit the same behavior. I would be suprised to see them reporting the same second, 2 minutes apart... On Tue, Jan 24, 2006 at 07:45:31AM -0500, Woods, Jeff wrote: > We had a similar problem a few weeks ago, except that our gmetad never > seemed to recover. It was crashing, and had to be restarted manually > almost daily. I enabled the debug output to syslog, but received no > indication of what was failing -- it just quit! restarting the server doesn't seem to have any effect. :( > At the time, we were in the process of consolidating our gmetad's to a > single server (we have three clusters being monitored, and each had > its own gmetad and web interface). Following the migration to the new > server, the problem went away so we never followed up. I intend to migrate to a new server soon as well... Of course, that's one of those projects that's going to happen Real Soon Now(tm). I'm worried though, because I realized today that a second instance of ganglia I've got running on a completely separate network is also showing these symptoms. Different hardware, different network, different switches, different load, same OS (mostly. Fedora core 3/4). > The gmetad we had problems with worked reliably for nearly a year > before having the problems. Once the problem started, it occurred > reliably (nearly every night). I could reenable the interface if it > might help to resolve a bigger problem. Thanks for the offer, but I'll do some more poking before putting you to that trouble. It's just such a wierd problem... On Tue, Jan 24, 2006 at 04:46:50PM -0500, Rick Mohr wrote: > Also, you could use rrdtool to generate the exact same graph that is > shown > on the web page for one of these metrice and dump it straight into a file. > Then you could compare that with the image seen on the web page (to check > for the unlikely event that the generated image if fine, but the web server > is messing something up). hmm... that's a good suggestion. Here's an excerpt from 'rrdtool dump': <!-- 2006-01-24 17:36:45 PST / 1138153005 --> <row><v> 9.3154666667e+00 </v></row> <!-- 2006-01-24 17:37:00 PST / 1138153020 --> <row><v> 8.8000000000e+00 </v></row> <!-- 2006-01-24 17:37:15 PST / 1138153035 --> <row><v> 8.8000000000e+00 </v></row> <!-- 2006-01-24 17:37:30 PST / 1138153050 --> <row><v> 8.8000000000e+00 </v></row> <!-- 2006-01-24 17:37:45 PST / 1138153065 --> <row><v> 8.8000000000e+00 </v></row> <!-- 2006-01-24 17:38:00 PST / 1138153080 --> <row><v> NaN </v></row> <!-- 2006-01-24 17:38:15 PST / 1138153095 --> <row><v> NaN </v></row> <!-- 2006-01-24 17:38:30 PST / 1138153110 --> <row><v> NaN </v></row> <!-- 2006-01-24 17:38:45 PST / 1138153125 --> <row><v> NaN </v></row> <!-- 2006-01-24 17:39:00 PST / 1138153140 --> <row><v> NaN </v></row> Correspondingly, in the graph seen through ganglia, the data ends about 17:38. I'm suprised it's registering these things every 15 seconds! I thought the period was slower than that (every min). I checked a few other rrds at different resolutions, and the NaN sections do correspond to the blank parts. So what does it mean? This tells us that the data is not getting put into the rrds. We know that the values are getting to the collector host, because clicking on the 'gmetric' portion of the website shows current data. But that data is not making it into the RRD somehow... I thought maybe the RRDs had become corrupted somehow, so tried out moving the rrds out of place so ganglia would recreate them all. The symptom was still in evidence. On Tue, Jan 24, 2006 at 01:56:08PM -0800, steven wagner wrote: > Running gmetad in the foreground with a very high debug level may > offer > additional clues. Also, keep an eye on the modification times on the > RRD files that are gapping. I can't see anything too interesting running gmetad in the foreground with debugging set to '9'. :( modification time of the rrd files seem to be current. This matches the rrd dump showing 'NaN' in all those fields instead of something unmodified. On Tue, Jan 24, 2006 at 05:06:48PM -0500, Jason A. Smith wrote: > I have seen gaps sometimes. They almost always happen when gmetad > gets data from a cluster that has the same exact timestamp as its last > update. Look in your system logs for gmetad errors like: > > /usr/sbin/gmetad[7695]: RRD_update (/var/lib/ganglia/rrds/Cluster > Name/hostname/metric_name.rrd): illegal attempt to update using time > 1138138243 when last update time is 1138138243 (minimum one second > step) I don't see that error message, but while looking for it, I did see this error message: Jan 24 17:24:18 localhost /usr/sbin/gmetad[30443]: RRD_update (/var/lib/ganglia/rrds/production/raiden-8-db1/users.rrd): conversion of 'min,' to float not complete: tail 'min,' This seems to relate to a recent change I made that I had forgotten about. :) I added the following line to my crontab: */2 * * * * /usr/bin/gmetric --name="users" --value=`w | head -1 | awk '{print $6}'` --type=int16 The purpose of this line is to create a graph representing the number of logged in users to the host. it seems right to me - do any of you see a problem with this line? In the course of this investigation, I have come across another stange happening. Some of the metrics seem to be ... off. I have no idea if these things are related. I was suprised to notice that many of my servers show excessive time in the CPU_report graph as having all their time spent in CPU Wait. That didn't seem right and also didn't jive with the output of vmstat. Looking at the individual metrics that make up the cpu_report, I see: * cpu_aidle: 1388 * cpu_idle: 66.00 * cpu_nice: 0.00 * cpu_system: 2.30 * cpu_user: 31.70 * cpu_wio: 1388 All 6 of these metrics are supposed to be percentages. What's up with 1,388? Bouth cpu_aidle and cpu_wio are linearly decreasing graphs with the same slope (and same current value). They look to be the same back into the shown history, but it's hard to be exact. This seems to be the case (with different current values) on a number of hosts. Two .pngs of hosts exhibiting this behavior are at http://cryptio.net/~ben/ganglia/host_report.png and http://cryptio.net/~ben/ganglia/host_report2.png Note that these stats are all created since I moved the old files out of place earlier today, so there is no chance of left over corruption. Are my hosts dying? restarting gmond on the host seems to have no effect. Would it be possible to create this kind of error by upgrading the server to gmetad 3.0.2 but leaving the clients running gmond 3.0.1? Yes, it seems to be the case. Upgrading the reporting host to 3.0.2 fixes the strange cpu report symptom. That's kind of unfortunate, that the gmond and gmetad are not compatible across a minor version like that. :( ok, I think that's enough for this post. Again, thanks for all your help and insight. -ben -- Ben Hartshorne email: [EMAIL PROTECTED] http://ben.hartshorne.net ------------------------------------------------------------------------ For more information about Barclays Capital, please visit our web site at http://www.barcap.com. Internet communications are not secure and therefore the Barclays Group does not accept legal responsibility for the contents of this message. Although the Barclays Group operates anti-virus programmes, it does not accept responsibility for any damage whatsoever that is caused by viruses being passed. Any views or opinions presented are solely those of the author and do not necessarily represent those of the Barclays Group. Replies to this email may be monitored by the Barclays Group for operational or business reasons. ------------------------------------------------------------------------

