Re: [Ganglia-general] zillions of loged ganglia messages.

Jose Antonio Jimenez Baena Tue, 31 Jul 2007 23:55:48 -0700

It's the latter ... I have disjoint groups of hosts reporting to unique 
headnodes, and the cluster name is the same across  the headnodes, because 
all the servers are partitions of the same machine. I'm using unicast.


Cluster view seems ok, although I haven't been with ganglia for too much 
time yet ...

I continously get the ganglia messages, but only for '__SummaryInfo__' 
category:

Aug  1 00:01:49 spdbinfpr1 user:info /opt/freeware/sbin/gmetad[745560]: 
RRD_update (/var/lib/ganglia/rrds/p570
_spdbsctms1/__SummaryInfo__/cpu_system.rrd): illegal attempt to update 
using time 1185919308 when last update
time is 1185919308 (minimum one second step)

Could it be because in fact I have several headnodes reporting info for 
the same level ( __SummaryInfo__ )  ? If this is the case, how could be 
avoided ?


Thanks.

Regards. José Antonio.





richard grevis <[EMAIL PROTECTED]> 
31/07/2007 16:48

To
Jose Antonio Jimenez Baena/Spain/[EMAIL PROTECTED]
cc
[EMAIL PROTECTED], Andrea Capriotti 
<[EMAIL PROTECTED]>, [email protected]
Subject
Re: [Ganglia-general] zillions of loged ganglia messages.






Jose,

I am a bit confused about your last paragraph. Do the hosts in your
cluster report to more than one headnode, or do you have disjoint groups
of hosts reporting to unique headnodes, except the cluster name is the
same across headnodes? It sounds like the latter.

But won't that mean the cluster level views will be messily wrong?
If you've done this because of firewalls and multicast, you could consider
unicast.
 
kind regards,
Richard


Quoting Jose Antonio Jimenez Baena <[EMAIL PROTECTED]>:

> in case it helps ... I'm also getting these errors, and I think that I'm 

> in the scenario 2 - case : 
> 
> " - gmetad.conf contains multiple entries for the same cluster.
>   In this case you will get the RRDs hit twice and thus errors for both
>   host level and cluster level ('summary' pathnames). This may not be 
> obvious,
>   because the cluster name comes from the gmond headnode, and *not*
>   from the string in gmetad.conf. Think in terms of gmetad polling
>   the same cluster data twice - it may be a dup in gmetad.conf, or it
>   may be DNS aliases resolving the the same headnode. "
> 
> 
>         I have several entries en gmetad.conf for the same cluster ... 
but 
> it has to be in that way, because I have different gmond heapnodes 
> reporting info from different groups of servers, which all are part of 
the 
> same machine ( they are partitions  (ibm power5 LPARs) ). I have 
different 
> heapnodes because these group of servers are if different LANs behind a 
> firewall.
> 
> Regards. José Antonio.
> 
> 
> 
> 
> 
> 
> 
> 
> richard grevis <[EMAIL PROTECTED]> 
> Sent by: [EMAIL PROTECTED]
> 28/07/2007 03:04
> 
> To
> Andrea Capriotti <[EMAIL PROTECTED]>
> cc
> [email protected]
> Subject
> Re: [Ganglia-general] zillions of loged ganglia messages.
> 
> 
> 
> 
> 
> 
> Andria,
> 
> This kind or logged error is one that I *am* familiar with.
> 
> When you get RRD update errors where the logged times are the same, e.g. 
-
>     Jul 27 15:40:43 tanabis /usr/sbin/gmetad[2941]: RRD_update
> 
> 
(/dev/shm/ganglia/rrds/BCX_Linux_Cluster/__SummaryInfo__/mem_buffers.rrd): 

> illegal attempt to
>     update using time 1185543615 when last update time is 1185543615 
> (minimum one second step).
> 
> This is different to the time leaping backwards problem.
> 
> Scenario 1 (not your case), where the errors are on individual host's 
RRD 
> updates.
> Possible errors:
> 
> - There are 2 hosts with the same hostname inside a single cluster. This 

> will
>   cause gmetad to update the RRD files for that host twice with the same 

> time
>   in quick succession for the cluster.
> - In our case it was *always* because of errors in the reverse DNS 
lookups 
> of
>   hosts on the gmond headnode. The headnode only sees the IP addresses 
of 
> the
>   UDP packet flung at it, and reverse looks DNS to get the hostname. Do 
a 
> netcat
>   of the cluster headnode and see if a host is mentioned twice.
> 
> Scenario 2 (maybe your case) -
> 
> - There were times when I saw this and couldn't track it down, and
>   times when I did find the problem.  Possible causes:
> - gmetad.conf contains multiple entries for the same cluster.
>   In this case you will get the RRDs hit twice and thus errors for both
>   host level and cluster level ('summary' pathnames). This may not be 
> obvious,
>   because the cluster name comes from the gmond headnode, and *not*
>   from the string in gmetad.conf. Think in terms of gmetad polling
>   the same cluster data twice - it may be a dup in gmetad.conf, or it
>   may be DNS aliases resolving the the same headnode.
> - Errors for 'summary' RRDs and not host RRDs? You almost certainly have
>   two distinct clusters that have the same cluster name at the grid 
level.
>   netcat the gmetad server on port 8651 and look for cluster dups. Also
>   nc all clusters and grep for identical cluster names on distinct 
> clusters.
>   It will be the headnode gmond.conf on a headnode that needs fixing.
> 
> Use multicast? too scary, you are on your own.
> 
> Restart all gmetad's and gmond's too, if you have been moving stuff 
> around.
> 
> I wrote some perl scripts to munch through clusters and grids for 
errors.
> Try mailing Matt Toy of Barcap and see if some scripts a help you.
> 
> Quoting Andrea Capriotti <[EMAIL PROTECTED]>:
> 
> > Il giorno ven, 27/07/2007 alle 07.20 +0100, richard grevis ha scritto:
> > > This used to happen intermittently with us too. In our case
> > > 
> > > - It occured with gmond data from windows hosts.
> > > - The data from the agent leapt back about a month and a half.
> > >   but your leap seems to be 2 years.
> > > - THE TIMES ON SERVER AND AGENT WERE FINE AND CORRECT.
> > > - It was the gmond that reported wrong times.
> > 
> > Same problem here, after a migration to a new machine, OS (SUSE SLES 9
> > from Fedora) and ganglia version (3.0.3 from 3.0.0):
> > 
> > Jul 27 15:40:43 tanabis /usr/sbin/gmetad[2941]: RRD_update
> > 
(/dev/shm/ganglia/rrds/BCX_Linux_Cluster/__SummaryInfo__/mem_total.rrd):
> > illegal attempt to update using time 1185543615 when last update time 
is
> > 1185543615 (minimum one second step) 
> > Jul 27 15:40:43 tanabis /usr/sbin/gmetad[2941]: RRD_update
> > 
(/dev/shm/ganglia/rrds/BCX_Linux_Cluster/__SummaryInfo__/cpu_aidle.rrd):
> > illegal attempt to update using time 1185543615 when last update time 
is
> > 1185543615 (minimum one second step) 
> > Jul 27 15:40:43 tanabis /usr/sbin/gmetad[2941]: RRD_update
> > 
(/dev/shm/ganglia/rrds/BCX_Linux_Cluster/__SummaryInfo__/bytes_in.rrd):
> > illegal attempt to update using time 1185543615 when last update time 
is
> > 1185543615 (minimum one second step) 
> > Jul 27 15:40:43 tanabis /usr/sbin/gmetad[2941]: RRD_update
> > 
> 
(/dev/shm/ganglia/rrds/BCX_Linux_Cluster/__SummaryInfo__/mem_buffers.rrd): 

> illegal attempt to
> > update using time 1185543615 when last update time is 1185543615 
> (minimum one second step) 
> > Jul 27 15:40:43 tanabis /usr/sbin/gmetad[2941]: RRD_update
> > 
> 
(/dev/shm/ganglia/rrds/BCX_Linux_Cluster/__SummaryInfo__/mem_shared.rrd): 
> illegal attempt to
> > update using time 1185543615 when last update time is 1185543615 
> (minimum one second step) 
> > Jul 27 15:40:43 tanabis /usr/sbin/gmetad[2941]: RRD_update
> > 
> 
(/dev/shm/ganglia/rrds/BCX_Linux_Cluster/__SummaryInfo__/swap_total.rrd): 
> illegal attempt to
> > update using time 1185543615 when last update time is 1185543615 
> (minimum one second step) 
> > Jul 27 15:40:43 tanabis /usr/sbin/gmetad[2941]: RRD_update
> > 
> 
(/dev/shm/ganglia/rrds/BCX_Linux_Cluster/__SummaryInfo__/part_max_used.rrd): 

> illegal attempt to
> > update using time 1185543615 when last update time is 1185543615 
> (minimum one second step)
> > 
> > Always with the same timestamp.
> > 
> > We have 6 data sources, a central gmetad in the web frontend machine 
and
> > all the clusters nodes are syncronized with an ntp server.
> > 
> > For example:
> > 
> > # telnet xxx.xxx.xxx.xxx 8651 | grep LOCALTIME
> > <GRID NAME="CINECA" AUTHORITY="http://xxxxxxx"; LOCALTIME="1185534734">
> > <CLUSTER NAME="BCC_Linux_Cluster" LOCALTIME="1185534726" 
OWNER="CINECA"
> > LATLONG="unspecified" URL="http://xxxxxx";>
> > 
> > Any idea?
> > 
> > Best Regards 
> > -- 
> > Andrea Capriotti
> > System Management Group - Cineca - www.cineca.it
> > [EMAIL PROTECTED] - Tel +39 051 6171890
> > 
> > 
> > 
> 
-------------------------------------------------------------------------
> > This SF.net email is sponsored by: Splunk Inc.
> > Still grepping through log files to find problems?  Stop.
> > Now Search log events and configuration files using AJAX and a 
browser.
> > Download your FREE copy of Splunk now >>  http://get.splunk.com/
> > _______________________________________________
> > Ganglia-general mailing list
> > [email protected]
> > https://lists.sourceforge.net/lists/listinfo/ganglia-general
> > 
> 
> 
> -- 
> kind regards,
> Richard
> 
> 
-------------------------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc.
> Still grepping through log files to find problems?  Stop.
> Now Search log events and configuration files using AJAX and a browser.
> Download your FREE copy of Splunk now >>  http://get.splunk.com/
> _______________________________________________
> Ganglia-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/ganglia-general
> 
>

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/

_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Re: [Ganglia-general] zillions of loged ganglia messages.

Reply via email to