Re: [Ganglia-general] zillions of loged ganglia messages.

Jose Antonio Jimenez Baena Mon, 30 Jul 2007 02:48:38 -0700

in case it helps ... I'm also getting these errors, and I think that I'm 
in the scenario 2 - case :


" - gmetad.conf contains multiple entries for the same cluster.
  In this case you will get the RRDs hit twice and thus errors for both
  host level and cluster level ('summary' pathnames). This may not be 
obvious,
  because the cluster name comes from the gmond headnode, and *not*
  from the string in gmetad.conf. Think in terms of gmetad polling
  the same cluster data twice - it may be a dup in gmetad.conf, or it
  may be DNS aliases resolving the the same headnode. "


        I have several entries en gmetad.conf for the same cluster ... but 
it has to be in that way, because I have different gmond heapnodes 
reporting info from different groups of servers, which all are part of the 
same machine ( they are partitions  (ibm power5 LPARs) ). I have different 
heapnodes because these group of servers are if different LANs behind a 
firewall.

Regards. José Antonio.








richard grevis <[EMAIL PROTECTED]> 
Sent by: [EMAIL PROTECTED]
28/07/2007 03:04

To
Andrea Capriotti <[EMAIL PROTECTED]>
cc
[email protected]
Subject
Re: [Ganglia-general] zillions of loged ganglia messages.






Andria,

This kind or logged error is one that I *am* familiar with.

When you get RRD update errors where the logged times are the same, e.g. -
    Jul 27 15:40:43 tanabis /usr/sbin/gmetad[2941]: RRD_update
 
(/dev/shm/ganglia/rrds/BCX_Linux_Cluster/__SummaryInfo__/mem_buffers.rrd): 
illegal attempt to
    update using time 1185543615 when last update time is 1185543615 
(minimum one second step).

This is different to the time leaping backwards problem.

Scenario 1 (not your case), where the errors are on individual host's RRD 
updates.
Possible errors:

- There are 2 hosts with the same hostname inside a single cluster. This 
will
  cause gmetad to update the RRD files for that host twice with the same 
time
  in quick succession for the cluster.
- In our case it was *always* because of errors in the reverse DNS lookups 
of
  hosts on the gmond headnode. The headnode only sees the IP addresses of 
the
  UDP packet flung at it, and reverse looks DNS to get the hostname. Do a 
netcat
  of the cluster headnode and see if a host is mentioned twice.

Scenario 2 (maybe your case) -

- There were times when I saw this and couldn't track it down, and
  times when I did find the problem.  Possible causes:
- gmetad.conf contains multiple entries for the same cluster.
  In this case you will get the RRDs hit twice and thus errors for both
  host level and cluster level ('summary' pathnames). This may not be 
obvious,
  because the cluster name comes from the gmond headnode, and *not*
  from the string in gmetad.conf. Think in terms of gmetad polling
  the same cluster data twice - it may be a dup in gmetad.conf, or it
  may be DNS aliases resolving the the same headnode.
- Errors for 'summary' RRDs and not host RRDs? You almost certainly have
  two distinct clusters that have the same cluster name at the grid level.
  netcat the gmetad server on port 8651 and look for cluster dups. Also
  nc all clusters and grep for identical cluster names on distinct 
clusters.
  It will be the headnode gmond.conf on a headnode that needs fixing.

Use multicast? too scary, you are on your own.

Restart all gmetad's and gmond's too, if you have been moving stuff 
around.

I wrote some perl scripts to munch through clusters and grids for errors.
Try mailing Matt Toy of Barcap and see if some scripts a help you.

Quoting Andrea Capriotti <[EMAIL PROTECTED]>:

> Il giorno ven, 27/07/2007 alle 07.20 +0100, richard grevis ha scritto:
> > This used to happen intermittently with us too. In our case
> > 
> > - It occured with gmond data from windows hosts.
> > - The data from the agent leapt back about a month and a half.
> >   but your leap seems to be 2 years.
> > - THE TIMES ON SERVER AND AGENT WERE FINE AND CORRECT.
> > - It was the gmond that reported wrong times.
> 
> Same problem here, after a migration to a new machine, OS (SUSE SLES 9
> from Fedora) and ganglia version (3.0.3 from 3.0.0):
> 
> Jul 27 15:40:43 tanabis /usr/sbin/gmetad[2941]: RRD_update
> (/dev/shm/ganglia/rrds/BCX_Linux_Cluster/__SummaryInfo__/mem_total.rrd):
> illegal attempt to update using time 1185543615 when last update time is
> 1185543615 (minimum one second step) 
> Jul 27 15:40:43 tanabis /usr/sbin/gmetad[2941]: RRD_update
> (/dev/shm/ganglia/rrds/BCX_Linux_Cluster/__SummaryInfo__/cpu_aidle.rrd):
> illegal attempt to update using time 1185543615 when last update time is
> 1185543615 (minimum one second step) 
> Jul 27 15:40:43 tanabis /usr/sbin/gmetad[2941]: RRD_update
> (/dev/shm/ganglia/rrds/BCX_Linux_Cluster/__SummaryInfo__/bytes_in.rrd):
> illegal attempt to update using time 1185543615 when last update time is
> 1185543615 (minimum one second step) 
> Jul 27 15:40:43 tanabis /usr/sbin/gmetad[2941]: RRD_update
> 
(/dev/shm/ganglia/rrds/BCX_Linux_Cluster/__SummaryInfo__/mem_buffers.rrd): 
illegal attempt to
> update using time 1185543615 when last update time is 1185543615 
(minimum one second step) 
> Jul 27 15:40:43 tanabis /usr/sbin/gmetad[2941]: RRD_update
> 
(/dev/shm/ganglia/rrds/BCX_Linux_Cluster/__SummaryInfo__/mem_shared.rrd): 
illegal attempt to
> update using time 1185543615 when last update time is 1185543615 
(minimum one second step) 
> Jul 27 15:40:43 tanabis /usr/sbin/gmetad[2941]: RRD_update
> 
(/dev/shm/ganglia/rrds/BCX_Linux_Cluster/__SummaryInfo__/swap_total.rrd): 
illegal attempt to
> update using time 1185543615 when last update time is 1185543615 
(minimum one second step) 
> Jul 27 15:40:43 tanabis /usr/sbin/gmetad[2941]: RRD_update
> 
(/dev/shm/ganglia/rrds/BCX_Linux_Cluster/__SummaryInfo__/part_max_used.rrd): 
illegal attempt to
> update using time 1185543615 when last update time is 1185543615 
(minimum one second step)
> 
> Always with the same timestamp.
> 
> We have 6 data sources, a central gmetad in the web frontend machine and
> all the clusters nodes are syncronized with an ntp server.
> 
> For example:
> 
> # telnet xxx.xxx.xxx.xxx 8651 | grep LOCALTIME
> <GRID NAME="CINECA" AUTHORITY="http://xxxxxxx"; LOCALTIME="1185534734">
> <CLUSTER NAME="BCC_Linux_Cluster" LOCALTIME="1185534726" OWNER="CINECA"
> LATLONG="unspecified" URL="http://xxxxxx";>
> 
> Any idea?
> 
> Best Regards 
> -- 
> Andrea Capriotti
> System Management Group - Cineca - www.cineca.it
> [EMAIL PROTECTED] - Tel +39 051 6171890
> 
> 
> 
-------------------------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc.
> Still grepping through log files to find problems?  Stop.
> Now Search log events and configuration files using AJAX and a browser.
> Download your FREE copy of Splunk now >>  http://get.splunk.com/
> _______________________________________________
> Ganglia-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/ganglia-general
> 


-- 
kind regards,
Richard

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/

_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Re: [Ganglia-general] zillions of loged ganglia messages.

Reply via email to