Hi Ben, see below. In any case, could you please open a case in bugzilla and assign it to me?
Martin --- Ben Hartshorne <[EMAIL PROTECTED]> wrote: > > Everyone, > > thanks very much for your suggestions. I've replied to each below. > > > On Tue, Jan 24, 2006 at 04:16:08AM -0800, Martin Knoblauch wrote: > > just a thought - are your cluster nodes time-synched? Are they > [still] > > in-synch? > > to within a second or so. I also have several gmetrics that are > running > at a 2-min interval, and they exhibit the same behavior. I would be > suprised to see them reporting the same second, 2 minutes apart... > OK. That seems clean. [snip] > > On Tue, Jan 24, 2006 at 04:46:50PM -0500, Rick Mohr wrote: > > Also, you could use rrdtool to generate the exact same graph that > is shown > > on the web page for one of these metrice and dump it straight into > a file. > > Then you could compare that with the image seen on the web page (to > check > > for the unlikely event that the generated image if fine, but the > web server > > is messing something up). > > hmm... that's a good suggestion. > > Here's an excerpt from 'rrdtool dump': > > <!-- 2006-01-24 17:36:45 PST / 1138153005 --> <row><v> > 9.3154666667e+00 </v></row> > <!-- 2006-01-24 17:37:00 PST / 1138153020 --> <row><v> > 8.8000000000e+00 </v></row> > <!-- 2006-01-24 17:37:15 PST / 1138153035 --> <row><v> > 8.8000000000e+00 </v></row> > <!-- 2006-01-24 17:37:30 PST / 1138153050 --> <row><v> > 8.8000000000e+00 </v></row> > <!-- 2006-01-24 17:37:45 PST / 1138153065 --> <row><v> > 8.8000000000e+00 </v></row> > <!-- 2006-01-24 17:38:00 PST / 1138153080 --> <row><v> NaN </v></row> > <!-- 2006-01-24 17:38:15 PST / 1138153095 --> <row><v> NaN </v></row> > <!-- 2006-01-24 17:38:30 PST / 1138153110 --> <row><v> NaN </v></row> > <!-- 2006-01-24 17:38:45 PST / 1138153125 --> <row><v> NaN </v></row> > <!-- 2006-01-24 17:39:00 PST / 1138153140 --> <row><v> NaN </v></row> > > Correspondingly, in the graph seen through ganglia, the data ends > about > 17:38. I'm suprised it's registering these things every 15 seconds! > I > thought the period was slower than that (every min). > > I checked a few other rrds at different resolutions, and the NaN > sections do correspond to the blank parts. > > So what does it mean? This tells us that the data is not getting put > into the rrds. We know that the values are getting to the collector > host, because clicking on the 'gmetric' portion of the website shows > current data. But that data is not making it into the RRD somehow... > > I thought maybe the RRDs had become corrupted somehow, so tried out > moving the rrds out of place so ganglia would recreate them all. The > symptom was still in evidence. > > > I don't see that error message, but while looking for it, I did see > this > error message: > > Jan 24 17:24:18 localhost /usr/sbin/gmetad[30443]: RRD_update > (/var/lib/ganglia/rrds/production/raiden-8-db1/users.rrd): conversion > of > 'min,' to float not complete: tail 'min,' > > This seems to relate to a recent change I made that I had forgotten > about. :) I added the following line to my crontab: > > */2 * * * * /usr/bin/gmetric --name="users" --value=`w | head -1 | > awk '{print $6}'` --type=int16 > > The purpose of this line is to create a graph representing the number > of > logged in users to the host. it seems right to me - do any of you > see a > problem with this line? > Not sure. How does the live "users" metric from gmond look like? Definitely an interesting coincidence. In any case, we need to look into how "gmetad" operates with rrdtool. Unfortunatelly, I am more the "gmond" guy. Most importatn, we need to find out what triggers the behaviour. Thanks for your patience. > > > In the course of this investigation, I have come across another > stange > happening. Some of the metrics seem to be ... off. I have no idea > if > these things are related. I was suprised to notice that many of my > servers show excessive time in the CPU_report graph as having all > their > time spent in CPU Wait. That didn't seem right and also didn't jive > with the output of vmstat. Looking at the individual metrics that > make > up the cpu_report, I see: > > * cpu_aidle: 1388 > * cpu_idle: 66.00 > * cpu_nice: 0.00 > * cpu_system: 2.30 > * cpu_user: 31.70 > * cpu_wio: 1388 > > All 6 of these metrics are supposed to be percentages. What's up > with > 1,388? Bouth cpu_aidle and cpu_wio are linearly decreasing graphs > with > the same slope (and same current value). They look to be the same > back > into the shown history, but it's hard to be exact. This seems to be > the > case (with different current values) on a number of hosts. > > Two .pngs of hosts exhibiting this behavior are at > http://cryptio.net/~ben/ganglia/host_report.png and > http://cryptio.net/~ben/ganglia/host_report2.png > > Note that these stats are all created since I moved the old files out > of > place earlier today, so there is no chance of left over corruption. > > Are my hosts dying? restarting gmond on the host seems to have no > effect. > > Would it be possible to create this kind of error by upgrading the > server to gmetad 3.0.2 but leaving the clients running gmond 3.0.1? > Yes, it seems to be the case. Upgrading the reporting host to 3.0.2 > fixes the strange cpu report symptom. That's kind of unfortunate, > that > the gmond and gmetad are not compatible across a minor version like > that. :( > Unfortunatelly 3.0.0/3.0.1 were not the "most perfect releases in the world" :-( We had some "accidents" with CVS and a few things got lost. As a result some metrics (like wio) were just completely broken. This has been fixed in 3.0.2. So, it is not a question of incompatibility, but gmond-3.0.1 reporting wrong stuff. My advice is definitely to skip 3.0.0 and 3.0.1 for "gmond". gmetad-3.0.1 should be OK with gmond-3.0.2, but I think just upgrade that to. ------------------------------------------------------ Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de