Hi Ben,

 see below. In any case, could you please open a case in bugzilla and
assign it to me?

Martin

--- Ben Hartshorne <[EMAIL PROTECTED]> wrote:

> 
> Everyone,
> 
> thanks very much for your suggestions.  I've replied to each below.
> 
> 
> On Tue, Jan 24, 2006 at 04:16:08AM -0800, Martin Knoblauch wrote:
> >  just a thought - are your cluster nodes time-synched? Are they
> [still]
> > in-synch?
> 
> to within a second or so.  I also have several gmetrics that are
> running
> at a 2-min interval, and they exhibit the same behavior.  I would be
> suprised to see them reporting the same second, 2 minutes apart...
>

 OK. That seems clean.

 [snip]
 
> 
> On Tue, Jan 24, 2006 at 04:46:50PM -0500, Rick Mohr wrote:
> > Also, you could use rrdtool to generate the exact same graph that
> is shown 
> > on the web page for one of these metrice and dump it straight into
> a file.  
> > Then you could compare that with the image seen on the web page (to
> check 
> > for the unlikely event that the generated image if fine, but the
> web server 
> > is messing something up).
> 
> hmm... that's a good suggestion.  
> 
> Here's an excerpt from 'rrdtool dump':
> 
> <!-- 2006-01-24 17:36:45 PST / 1138153005 --> <row><v>
> 9.3154666667e+00 </v></row>
> <!-- 2006-01-24 17:37:00 PST / 1138153020 --> <row><v>
> 8.8000000000e+00 </v></row>
> <!-- 2006-01-24 17:37:15 PST / 1138153035 --> <row><v>
> 8.8000000000e+00 </v></row>
> <!-- 2006-01-24 17:37:30 PST / 1138153050 --> <row><v>
> 8.8000000000e+00 </v></row>
> <!-- 2006-01-24 17:37:45 PST / 1138153065 --> <row><v>
> 8.8000000000e+00 </v></row>
> <!-- 2006-01-24 17:38:00 PST / 1138153080 --> <row><v> NaN </v></row>
> <!-- 2006-01-24 17:38:15 PST / 1138153095 --> <row><v> NaN </v></row>
> <!-- 2006-01-24 17:38:30 PST / 1138153110 --> <row><v> NaN </v></row>
> <!-- 2006-01-24 17:38:45 PST / 1138153125 --> <row><v> NaN </v></row>
> <!-- 2006-01-24 17:39:00 PST / 1138153140 --> <row><v> NaN </v></row>
> 
> Correspondingly, in the graph seen through ganglia, the data ends
> about
> 17:38.  I'm suprised it's registering these things every 15 seconds! 
> I
> thought the period was slower than that (every min).
> 
> I checked a few other rrds at different resolutions, and the NaN
> sections do correspond to the blank parts.
> 
> So what does it mean?  This tells us that the data is not getting put
> into the rrds.  We know that the values are getting to the collector
> host, because clicking on the 'gmetric' portion of the website shows
> current data.  But that data is not making it into the RRD somehow...
> 
> I thought maybe the RRDs had become corrupted somehow, so tried out
> moving the rrds out of place so ganglia would recreate them all.  The
> symptom was still in evidence.
> 
> 
> I don't see that error message, but while looking for it, I did see
> this
> error message:
> 
> Jan 24 17:24:18 localhost /usr/sbin/gmetad[30443]: RRD_update
> (/var/lib/ganglia/rrds/production/raiden-8-db1/users.rrd): conversion
> of
> 'min,' to float not complete: tail 'min,'
> 
> This seems to relate to a recent change I made that I had forgotten
> about.  :)  I added the following line to my crontab:
> 
> */2 * * * * /usr/bin/gmetric --name="users" --value=`w | head -1 |
> awk '{print $6}'` --type=int16
> 
> The purpose of this line is to create a graph representing the number
> of
> logged in users to the host.  it seems right to me - do any of you
> see a
> problem with this line?
> 

 Not sure. How does the live "users" metric from gmond look like?
Definitely an interesting coincidence.

 In any case, we need to look into how "gmetad" operates with rrdtool.
Unfortunatelly, I am more the "gmond" guy.

 Most importatn, we need to find out what triggers the behaviour.
Thanks for your patience.

> 
> 
> In the course of this investigation, I have come across another
> stange
> happening.  Some of the metrics seem to be ... off.  I have no idea
> if
> these things are related. I was suprised to notice that many of my
> servers show excessive time in the CPU_report graph as having all
> their
> time spent in CPU Wait.  That didn't seem right and also didn't jive
> with the output of vmstat.  Looking at the individual metrics that
> make
> up the cpu_report, I see:
> 
> * cpu_aidle: 1388
> * cpu_idle: 66.00
> * cpu_nice: 0.00
> * cpu_system: 2.30
> * cpu_user: 31.70
> * cpu_wio: 1388
> 
> All 6 of these metrics are supposed to be percentages.  What's up
> with
> 1,388?  Bouth cpu_aidle and cpu_wio are linearly decreasing graphs
> with
> the same slope (and same current value).  They look to be the same
> back
> into the shown history, but it's hard to be exact.  This seems to be
> the
> case (with different current values) on a number of hosts.  
> 
> Two .pngs of hosts exhibiting this behavior are at
> http://cryptio.net/~ben/ganglia/host_report.png and
> http://cryptio.net/~ben/ganglia/host_report2.png
> 
> Note that these stats are all created since I moved the old files out
> of
> place earlier today, so there is no chance of left over corruption.  
> 
> Are my hosts dying?  restarting gmond on the host seems to have no
> effect.
> 
> Would it be possible to create this kind of error by upgrading the
> server to gmetad 3.0.2 but leaving the clients running gmond 3.0.1?
> Yes, it seems to be the case.  Upgrading the reporting host to 3.0.2
> fixes the strange cpu report symptom.  That's kind of unfortunate,
> that
> the gmond and gmetad are not compatible across a minor version like
> that.  :(
> 

 Unfortunatelly 3.0.0/3.0.1 were not the "most perfect releases in the
world" :-( We had some "accidents" with CVS and a few things got lost.
As a result some metrics (like wio) were just completely broken. This
has been fixed in 3.0.2. So, it is not a question of incompatibility,
but gmond-3.0.1 reporting wrong stuff. My advice is definitely to skip
3.0.0 and 3.0.1 for "gmond". gmetad-3.0.1 should be OK with
gmond-3.0.2, but I think just upgrade that to.



------------------------------------------------------
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de

Reply via email to