Re: [Ganglia-general] how to monitor switches?

2015-03-05 Thread Adam Tygart
Ganglia supports the *host* sflow counters. Not many switches export
those. I've only found Cumulus Linux switches that will do that.

--
Adam

On Thu, Mar 5, 2015 at 12:40 PM, Leslie geekg...@gmail.com wrote:
 Hi Aaron -

 I am not the expert on monitoring cisco switches, but ganglia does
 support sflow counters, which at least many Cisco switches support --
 http://blog.sflow.com/2010/10/ganglia.html
 http://www.cisco.com/c/en/us/td/docs/switches/datacenter/nexus3000/sw/system_mgmt/503_U4_1/b_3k_System_Mgmt_Config_503_u4_1/b_3k_System_Mgmt_Config_503_u4_1_chapter_010010.html#task_0AAC420D834048248B9407F9C8559918

 Hope this helps!
 Leslie

 On Thu, Mar 5, 2015 at 10:01 AM, Aaron hawaiiaa...@gmail.com wrote:
 Hi, I'd like to have ganglia monitor my switches such as an cisco sg300-20
 What is the best method to use to do so?  I see that ganglia can be used
 with python modules, gmetric, some json...?  Any tips?   I see info on
 monitoring switches but it seems to relate to nagios with snmp not just
 ganglia?  Thanks, Aaron

 --
 Dive into the World of Parallel Programming The Go Parallel Website,
 sponsored
 by Intel and developed in partnership with Slashdot Media, is your hub for
 all
 things parallel software development, from weekly thought leadership blogs
 to
 news, videos, case studies, tutorials and more. Take a look and join the
 conversation now. http://goparallel.sourceforge.net/
 ___
 Ganglia-general mailing list
 Ganglia-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/ganglia-general


 --
 Dive into the World of Parallel Programming The Go Parallel Website, sponsored
 by Intel and developed in partnership with Slashdot Media, is your hub for all
 things parallel software development, from weekly thought leadership blogs to
 news, videos, case studies, tutorials and more. Take a look and join the
 conversation now. http://goparallel.sourceforge.net/
 ___
 Ganglia-general mailing list
 Ganglia-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/ganglia-general

--
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Ganglia for Windows

2012-03-28 Thread Adam Tygart
I'd recommend upgrading your Linux hosts to Ganglia =3.2, and using
Host sFlow on the Windows hosts.

Ganglia = 3.2 can decode the host-sflow counters, and incorporate the
the hosts into the Ganglia databases.
http://host-sflow.sourceforge.net/

--
Adam

On Wed, Mar 28, 2012 at 13:35, Carlo Marcelo Arenas Belon
care...@sajinet.com.pe wrote:
 On Wed, Mar 28, 2012 at 12:49:00PM +0100, Burton, Steven wrote:

 Should I try those binaries or should I build a more recent version and
 if so, what version?

 3.0 and 3.1 are not compatible, so you either :

 1) build new binaries for 3.1.7 or newer and deploy it on windows
 2) downgrade your server to 3.0.7 (notice you need a couple of patches
   on top of it for security)

 Carlo

 --
 This SF email is sponsosred by:
 Try Windows Azure free for 90 days Click Here
 http://p.sf.net/sfu/sfd2d-msazure
 ___
 Ganglia-general mailing list
 Ganglia-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/ganglia-general

--
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] No folder for some hosts in /var/lib/ganglia/rrds/source, no graph on the web

2011-10-24 Thread Adam Tygart
There is an incompatibility between gmond 3.0.x and 3.1.x, as mentioned in
the release notes of 3.1.
http://sourceforge.net/apps/trac/ganglia/wiki/ganglia_release_notes
You cannot mix these versions in the same cluster, otherwise you will see
the behavior you detailed below.

--
Adam
On Oct 24, 2011 10:49 PM, quanta quanta.li...@gmail.com wrote:
--
The demand for IT networking professionals continues to grow, and the
demand for specialized networking skills is growing even more rapidly.
Take a complimentary Learning@Cisco Self-Assessment and learn 
about Cisco certifications, training, and career opportunities. 
http://p.sf.net/sfu/cisco-dev2dev___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Node not visible on web-front end

2011-05-24 Thread Adam Tygart
Per http://ganglia.info/?p=269, the gmond communications between =
3.0.x and =3.1.x are incompatible. You will need to build ganglia
gmond 3.1.7 for your redhat 4 node.

--
Adam

On Tue, May 24, 2011 at 04:50, Govind govind.r...@gmail.com wrote:
 Hi,

 I have ganglia server running on version 3.1.7-1 (Redhat 5)

 While adding a new node with ganglia-gmond-3.0.7-1 (Redhat 4) from

 The node is not visible on web-front,


 from debug of gmetad daemon I can see that it is being monitored
 ==


 Source: [ScratchServer, step 15] has 2 sources
 134.xx.xx.xx
 134.abc.def.gh
 Data thread 1188870464 is monitoring [ScratchServer] data source
 134.xx.xx.xx


 134.abc.def.gh
 ==

 On node
 ==
 telnet localhost 8672, i can see xml data is being genereated
 ?xml version=1.0 encoding=ISO-8859-1 standalone=yes?


 !DOCTYPE GANGLIA_XML [
!ELEMENT GANGLIA_XML (GRID|CLUSTER|HOST)*

 --snip

 /CLUSTER
 /GANGLIA_XML
 Connection closed by foreign host.

 


 Can you please advise how to troubleshoot this.

 Thanks
 Govind

 --
 vRanger cuts backup time in half-while increasing security.
 With the market-leading solution for virtual backup and recovery,
 you get blazing-fast, flexible, and affordable data protection.
 Download your free trial now.
 http://p.sf.net/sfu/quest-d2dcopy1
 ___
 Ganglia-general mailing list
 Ganglia-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/ganglia-general



--
vRanger cuts backup time in half-while increasing security.
With the market-leading solution for virtual backup and recovery, 
you get blazing-fast, flexible, and affordable data protection.
Download your free trial now. 
http://p.sf.net/sfu/quest-d2dcopy1
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Node not visible on web-front end

2011-05-24 Thread Adam Tygart
Try restarting the gmond and gmetad on your collector node. In my
testing, they have gotten into a bad state after having a 3.0.x node
trying to report to a 3.1.x collector.

--
Adam

On Tue, May 24, 2011 at 10:26, Govind govind.r...@gmail.com wrote:
 Hi Adam,

 Thanks for pointing to link.
 I have build 3.1.7 rpm for redhat 4 node but still it is the same problem.
 Node is not visible at webfront.

 Cheers
 Govind

 On Tue, May 24, 2011 at 1:26 PM, Adam Tygart adam.tyg...@gmail.com wrote:

 Per http://ganglia.info/?p=269, the gmond communications between =
 3.0.x and =3.1.x are incompatible. You will need to build ganglia
 gmond 3.1.7 for your redhat 4 node.

 --
 Adam

 On Tue, May 24, 2011 at 04:50, Govind govind.r...@gmail.com wrote:
  Hi,
 
  I have ganglia server running on version 3.1.7-1 (Redhat 5)
 
  While adding a new node with ganglia-gmond-3.0.7-1 (Redhat 4) from
 
  The node is not visible on web-front,
 
 
  from debug of gmetad daemon I can see that it is being monitored
  ==
 
 
  Source: [ScratchServer, step 15] has 2 sources
          134.xx.xx.xx
          134.abc.def.gh
  Data thread 1188870464 is monitoring [ScratchServer] data source
          134.xx.xx.xx
 
 
          134.abc.def.gh
  ==
 
  On node
  ==
  telnet localhost 8672, i can see xml data is being genereated
  ?xml version=1.0 encoding=ISO-8859-1 standalone=yes?
 
 
  !DOCTYPE GANGLIA_XML [
     !ELEMENT GANGLIA_XML (GRID|CLUSTER|HOST)*
 
  --snip
 
  /CLUSTER
  /GANGLIA_XML
  Connection closed by foreign host.
 
  
 
 
  Can you please advise how to troubleshoot this.
 
  Thanks
  Govind
 
 
  --
  vRanger cuts backup time in half-while increasing security.
  With the market-leading solution for virtual backup and recovery,
  you get blazing-fast, flexible, and affordable data protection.
  Download your free trial now.
  http://p.sf.net/sfu/quest-d2dcopy1
  ___
  Ganglia-general mailing list
  Ganglia-general@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/ganglia-general
 
 



--
vRanger cuts backup time in half-while increasing security.
With the market-leading solution for virtual backup and recovery, 
you get blazing-fast, flexible, and affordable data protection.
Download your free trial now. 
http://p.sf.net/sfu/quest-d2dcopy1
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Gmetric spoof clarification

2009-10-16 Thread Adam Tygart
It should be --spoof 1.2.3.4:server1

--
Adam

On Fri, Oct 16, 2009 at 08:42, Vladimir Vuksan vli...@veus.hr wrote:
 I am would like to get some clarification on  how to use the --spoof
 option in with gmetric. I am running HTTP response time checks from a
 head node that I would like to attach to a particular
 host ie. server1 - 1.2.3.4. If I run gmetric with following argument

 --spoof server1:1.2.3.4

 a new server 1.2.3.4 shows up. Am I misunderstanding how this is supposed
 to work ?

 Vladimir

 --
 Come build with us! The BlackBerry(R) Developer Conference in SF, CA
 is the only developer event you need to attend this year. Jumpstart your
 developing skills, take BlackBerry mobile applications to market and stay
 ahead of the curve. Join us from November 9 - 12, 2009. Register now!
 http://p.sf.net/sfu/devconference
 ___
 Ganglia-general mailing list
 Ganglia-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/ganglia-general


--
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Ganglia 3.1.2 -- Modules fail to load intermittently, Hosts disappear from the web interface, XML errors that can't be found

2009-05-15 Thread Adam Tygart
On Fri, May 15, 2009 at 04:32, Carlo Marcelo Arenas Belon
care...@sajinet.com.pe wrote:
 On Thu, May 14, 2009 at 11:47:41AM -0500, Adam Tygart wrote:

 I have been having a hack of a time diagnosing this problem.

 I suspect there are several problems here, which OS and architecture?

Gentoo Linux, webserver is x86, everything else is x86_64

 I recently updated to ganglia-3.1.2 for 3.0.7.

 3.1 and 3.0 are not compatible and can't be on the same cluster, so for
 this upgrade to be successfull you should have done :

  1) upgrade your gmetad/web to 3.1.2
  2) upgrade all gmond to 3.1.2, cluster by cluster in batches

 more details to be found in :

  http://ganglia.wiki.sourceforge.net/ganglia_release_notes

I already updated the entire cluster. My webserver is running the
proper versions of gmetad/web and everything is running the new
version of gmond.


 Since then I have been
 plagued with (what looked like) data errors, mis-reporting swap usage
 was the easiest to see.

 could you elaborate here?, is the value that gmond is collecting on each
 node incorrect?, is the agregated in gmetad incorrect?, which one of the
 swap metrics is incorrect?

Aggregate swap data being incorrect is the easiest to see.
Here is the graph from a mis-reporting host (it doesn't always even
send this information): http://imgur.com/io8gu.png

Here is the resulting aggregate graph: http://imgur.com/trato.png
The beginning of this graph is showing the correct data, I simply
restarted gmond (on all non-webserver hosts), and the resulting swap
usage was from one of them failing to send the correct data.


 # uname -a
 Linux dell 2.6.28-gentoo-r5 #1 SMP Thu Apr 23 21:35:08 PDT 2009 x86_64 
 Intel(R) Core(TM)2 CPU 6320 @ 1.86GHz GenuineIntel GNU/Linux
 # gmond --version
 gmond 3.1.2
 # telnet 127.0.0.1 8649 | grep swap
 METRIC NAME=swap_total VAL=4008176 TYPE=float UNITS=KB TN=60 
 TMAX=1200 DMAX=0 SLOPE=zero
 EXTRA_ELEMENT NAME=DESC VAL=Total amount of swap space displayed in KBs/
 Connection closed by foreign host.
 METRIC NAME=swap_free VAL=4008176 TYPE=float UNITS=KB TN=60 
 TMAX=180 DMAX=0 SLOPE=both
 EXTRA_ELEMENT NAME=DESC VAL=Amount of available swap memory/
 # free | grep Swap
 Swap:      4008176          0    4008176

 This seems to be caused by some reporting
 modules failing to load. They fail silently, I don't see logs about it
 anywhere, and when I turn debugging on I still don't see anything.

 AFAIK if a module fails to load because of an error it will just prevent
 gmond to start at all (some times silently) as detailed in the Known Issues.

 if the module is not loaded but it is still referred by the configuration
 for collecting it will also be very noisy about it :

 # /etc/init.d/gmond start
  * Starting GANGLIA gmond:  ...
 Cannot locate internal module structure 'mem_module' in file (null): 
 /usr/sbin/gmond: undefined symbol: mem_module
 Possibly an incorrect module language designation [(null)].
                                                                          [ ok 
 ]
 # tail /var/log/syslog | grep gmond
 May 15 01:53:23 dell /usr/sbin/gmond[13374]: Unable to find the metric 
 information for 'mem_total'. Possible that the module has not been loaded.
 May 15 01:53:23 dell /usr/sbin/gmond[13374]: Unable to find the metric 
 information for 'swap_total'. Possible that the module has not been loaded.
 May 15 01:53:23 dell /usr/sbin/gmond[13374]: Unable to find the metric 
 information for 'mem_free'. Possible that the module has not been loaded.
 May 15 01:53:23 dell /usr/sbin/gmond[13374]: Unable to find the metric 
 information for 'mem_shared'. Possible that the module has not been loaded.
 May 15 01:53:23 dell /usr/sbin/gmond[13374]: Unable to find the metric 
 information for 'mem_buffers'. Possible that the module has not been loaded.
 May 15 01:53:23 dell /usr/sbin/gmond[13374]: Unable to find the metric 
 information for 'mem_cached'. Possible that the module has not been loaded.
 May 15 01:53:23 dell /usr/sbin/gmond[13374]: Unable to find the metric 
 information for 'swap_free'. Possible that the module has not been loaded.

 what makes you think the module is not being loaded?, and that is being
 silent about that?, does it show in? :

  # lsof -p `pidof gmond` | grep ganglia

I though the module wasn't being loaded because the host was not
sending any data that would be gathered by that module to my reporting
host. I can now see that it is being loaded, just not sending all of
the data.

gmond   32678 nobody  memREG  8,3   22928   330627
/usr/lib64/ganglia/modpython.so
gmond   32678 nobody  memREG  8,3   97312   330621
/usr/lib64/ganglia/modsys.so
gmond   32678 nobody  memREG  8,3   96992   330624
/usr/lib64/ganglia/modproc.so
gmond   32678 nobody  memREG  8,3   97184   330630
/usr/lib64/ganglia/modnet.so
gmond   32678 nobody  memREG  8,3   97408   330613
/usr/lib64/ganglia/modmem.so
gmond   32678 nobody  memREG  8,3   97088   330636
/usr

Re: [Ganglia-general] Ganglia 3.1.2 -- Modules fail to load intermittently, Hosts disappear from the web interface, XML errors that can't be found

2009-05-15 Thread Adam Tygart
On Fri, May 15, 2009 at 10:28, Carlo Marcelo Arenas Belon
care...@sajinet.com.pe wrote:
 On Fri, May 15, 2009 at 08:42:33AM -0500, Adam Tygart wrote:
 On Fri, May 15, 2009 at 04:32, Carlo Marcelo Arenas Belon
 care...@sajinet.com.pe wrote:
  On Thu, May 14, 2009 at 11:47:41AM -0500, Adam Tygart wrote:
 
  Since then I have been
  plagued with (what looked like) data errors, mis-reporting swap usage
  was the easiest to see.
 
  could you elaborate here?, is the value that gmond is collecting on each
  node incorrect?, is the aggregated in gmetad incorrect?, which one of the
  swap metrics is incorrect?

 Aggregate swap data being incorrect is the easiest to see.
 Here is the graph from a mis-reporting host (it doesn't always even
 send this information): http://imgur.com/io8gu.png

 Here is the resulting aggregate graph: http://imgur.com/trato.png
 The beginning of this graph is showing the correct data, I simply
 restarted gmond (on all non-webserver hosts), and the resulting swap
 usage was from one of them failing to send the correct data.

 OK, the metric value is not incorrect, but is not being reported at all
 which is why you have dips on your graph that fix themselves after several
 minutes.

 This is sadly a known issue, because of the way that gmond register metrics
 dynamically and the fact that some of those metrics aren't refreshed that
 frequently as described in the Release Notes (mentioning as an example the
 CPU count issues which is very visible), for more details in the discussion
 look at :

  http://www.mail-archive.com/ganglia-general@lists.sourceforge.net/msg04275.html

 An eventhough I agree it is a bug doesn't have yet a solution, and is not
 seen unless gmond is restarted (any of them)

 a workaround is available, but ensuring that if you have to restart a gmond
 you restart first its collector (the one that gmetad is looking at) and the
 rest are pointing to when using unicast, and restart ALL other gmond in the
 cluster after that.

I should have specified how I got the graph. I had everything
working, modules were loaded, everything was being reported. In an
attempt to reproduce the issue I was having. I restarted the first
collector. This caused the gmond for this host to stop reporting. I
let it sit, this is when the large break in the graph occurred. I then
restarted gmond on this host. It then only reported cached memory. I
restarted it again. It then started reporting all memory statistics
correctly.

During this period, the aggregate graph showed a drop in hosts, and
then a recovery. The recovery was when gmond on the reporting hosts
were restarted the first time. The graph then shows an unusual amount
of swap usage. This is not the real data. Once I restarted gmond on
the mis-reporting host, again, the swap usage dropped.

  The question I have is this: is this a known bug?
 
  some are, like the unicast send_metadata_interval or the cpu_count
  inconsistency as shown by the Important Notes, some others might not be

 I haven't been able to find the Important Notes document, is there a
 link to this somewhere?

 sadly it is buried at the bottom of the Release Notes now :

  http://ganglia.wiki.sourceforge.net/ganglia_release_notes

 and yes I agree should be moved to a better place as well.

 Is the cpu_count inconsistency the piece I mentioned about hosts
 disappearing from the web interface?

 most likely the host disappearing from the web interface is because of
 the send_metadata_interval and you trying to restart the gmond to fix it.

 if it is not then we have a new bug ;)

The hosts are coming with every third or fourth manual refresh of the web-page.
Not all of them disappear, just some of them. Some hosts are more apt
to disappear than others: rogue2, janus, rogue10, rogue8 to name few.
(I have re-added hosts to make this effect more obvious).
If you would like to look at the page for yourself, it is located
here: http://beocat.cis.ksu.edu/
For reference, 27 hosts should show up now.

  Is there something else I should try?
 
  rollback to 3.0, specially if you don't need the modules but want a more
  stable setup.

 This being Gentoo, I have no easy way of rolling back, as the 3.0.x
 builds have been removed from their tree.

 OK, IMHO having ganglia 3.0 in their tree as well with a different slot
 might be a good idea, but sadly I haven't yet filed it as a bug or can
 provide a working ebuild in a public overlay yet as a solution either,
 but of course you can still do your own binaries/packages if needed.

 3.0 is still under development with 3.0.8 going to be released sometime soon
 and future releases focusing mainly on stability and compatibility with 3.1,
 as well as supporting all other architectures that are not yet working in
 3.1.

I have been tempted to roll-back, even if I have to roll my own build,
but I figured I would put in a real effort to make the new version
work for me.

 The whole reason I upgraded was because I wanted to make use

[Ganglia-general] Ganglia 3.1.2 -- Modules fail to load intermittently, Hosts disappear from the web interface, XML errors that can't be found

2009-05-14 Thread Adam Tygart
Hello everyone,

I have been having a hack of a time diagnosing this problem. I
recently updated to ganglia-3.1.2 for 3.0.7. Since then I have been
plagued with (what looked like) data errors, mis-reporting swap usage
was the easiest to see. This seems to be caused by some reporting
modules failing to load. They fail silently, I don't see logs about it
anywhere, and when I turn debugging on I still don't see anything.
Usually it is one of the modules, but I have had two occasionally
happen at the same time. modmem.so and modnet.so are the two to most
commonly fail.

I have restarted with a new gmond configuration, changing only the
configuration of multicast to unicast, and this problem persists. I
have wiped my old rrd data. I have tried everything I know that could
even remotely be to blame for this problem.

The question I have is this: is this a known bug? Is there something
else I should try? Can I force a module to be loaded?

When the modules do load, hosts report to gmond, and gmeta grabs that
data and logs it. My webserver then serves up the data through the
ganglia interface. The problem I am having here is that I get
intermittent xml errors, mostly saying that there is a missing  on
line $SomeLineNumber (always changes). Happens every 15 minutes or so.
I cannot reproduce any problems with the xml, however. I ran xmllint
on the xml 1 per second for an hour with no errors, during which time
the web interface failed to load twice.

I am also missing hosts from the web interface. The hosts (and
processors) get graphed properly on the composite graphs, but they
don't appear as down, or as up, they just disappear. I can enter
the hostname into the address bar, and get a current accurate graph
for it, though. Here is a screenshot of what I am talking about:
http://img.waffleimages.com/a47bc705ae3f5fd53a025e387ebbeb0c0841ad4a/Picture%2011.png

If you'll notice, processor count says 10, while the graph shows 14.
This is because the host (janus) is missing from the list. Once in a
while, it will show up correctly (for one refresh) then disappear
again.


I am sorry that I have written a daunting wall of text, but I am in
need of fixing these issues to properly roll-out the interface.

If it helps, ganglia was compiled on Gentoo through their build system
(portage).

Thanks,

Adam Tygart

--
The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your
production scanning environment may not be a perfect world - but thanks to
Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700
Series Scanner you'll get full speed at 300 dpi even with all image 
processing features enabled. http://p.sf.net/sfu/kodak-com
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Ganglia 3.1.2 -- Modules fail to load intermittently, Hosts disappear from the web interface, XML errors that can't be found

2009-05-14 Thread Adam Tygart
All of the XML is sent within the intranet. In fact with this latest
test, all of the XML is being passed through one switch. This is a
1Gbps switch with the switch itself being able to push 96Gbps split
across all ports. The network is currently pushing 1MBps, don't think
the network is maxed out. I have never had any packet loss within my
network, and large files are passed on this network daily.

I don't believe that network is a factor for another simple reason: it
worked fine with ganglia 3.0.7 since it was installed, at least 6
months ago.

I also noted that I was running xmllint against the XML data both from
gmetad and gmond and it was unable to find any problem with the XML.

I did just have the web interface choke on the data again, the latest
error being: There was an error collecting ganglia data
(127.0.0.1:8655): XML error: Invalid document end at 1067

An immediate refresh (within seconds) and the interface was back

Thanks for your quick response,
Adam

On Thu, May 14, 2009 at 11:58, Richard Edward Horner
r...@richhorner.com wrote:
 It may not be a problem with Ganglia. It may be a problem with your network.

 You're saying the line number in the error changes every time. That
 suggests to me that the transmission is getting fouled up at a
 different point each time which would be the expected behavior for an
 intermittent network problem. Is your network heavily taxed? Are all
 these machines local or do they talk over the WAN? Do you observe
 packet loss for anything? You may want to transfer some large files
 around and md5 them on the originating server and the destination
 server to see if they come across OK.

 Rich(ard)

 On Thu, May 14, 2009 at 4:47 PM, Adam Tygart adam.tyg...@gmail.com wrote:
 Hello everyone,

 I have been having a hack of a time diagnosing this problem. I
 recently updated to ganglia-3.1.2 for 3.0.7. Since then I have been
 plagued with (what looked like) data errors, mis-reporting swap usage
 was the easiest to see. This seems to be caused by some reporting
 modules failing to load. They fail silently, I don't see logs about it
 anywhere, and when I turn debugging on I still don't see anything.
 Usually it is one of the modules, but I have had two occasionally
 happen at the same time. modmem.so and modnet.so are the two to most
 commonly fail.

 I have restarted with a new gmond configuration, changing only the
 configuration of multicast to unicast, and this problem persists. I
 have wiped my old rrd data. I have tried everything I know that could
 even remotely be to blame for this problem.

 The question I have is this: is this a known bug? Is there something
 else I should try? Can I force a module to be loaded?

 When the modules do load, hosts report to gmond, and gmeta grabs that
 data and logs it. My webserver then serves up the data through the
 ganglia interface. The problem I am having here is that I get
 intermittent xml errors, mostly saying that there is a missing  on
 line $SomeLineNumber (always changes). Happens every 15 minutes or so.
 I cannot reproduce any problems with the xml, however. I ran xmllint
 on the xml 1 per second for an hour with no errors, during which time
 the web interface failed to load twice.

 I am also missing hosts from the web interface. The hosts (and
 processors) get graphed properly on the composite graphs, but they
 don't appear as down, or as up, they just disappear. I can enter
 the hostname into the address bar, and get a current accurate graph
 for it, though. Here is a screenshot of what I am talking about:
 http://img.waffleimages.com/a47bc705ae3f5fd53a025e387ebbeb0c0841ad4a/Picture%2011.png

 If you'll notice, processor count says 10, while the graph shows 14.
 This is because the host (janus) is missing from the list. Once in a
 while, it will show up correctly (for one refresh) then disappear
 again.


 I am sorry that I have written a daunting wall of text, but I am in
 need of fixing these issues to properly roll-out the interface.

 If it helps, ganglia was compiled on Gentoo through their build system
 (portage).

 Thanks,

 Adam Tygart

 --
 The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your
 production scanning environment may not be a perfect world - but thanks to
 Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700
 Series Scanner you'll get full speed at 300 dpi even with all image
 processing features enabled. http://p.sf.net/sfu/kodak-com
 ___
 Ganglia-general mailing list
 Ganglia-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/ganglia-general




 --
 Richard Edward Horner
 Engineer / Composer / Electric Guitar Virtuoso
 richhorner.com | rhosts.net | sabayonlinux.org


--
The NEW KODAK i700 Series Scanners deliver under ANY