I would definitely consider rrdcached
backed by some SSDs. That is what I use.
3.7.0 which is in testing has some additional performance
enhancements but I think your issue really is I/O.
Vladimir
On 05/19/2014 10:46 AM, Cristovao Jose Domingues Cordeiro wrote:
Hi,
I am using
Cumprimentos / Best regards,
Cristóvão José Domingues Cordeiro
Error 1 sending messages are a
red herring.
If you are seeing gaps it's most likely that storage
system is not keeping up. What version of ganglia are you
using and are you using rrdcached ?
Vladimir
On 05/19/2014 10:20 AM, Cristovao Jose Domingues Cordeiro
wrote:
Hi,
this is happening in two completely different (but with
the same deployment method) Ganglia headnodes.
I'm monitoring about 500 VM's (on each headnode),
separated by clusters of different sizes. From time to
time, the summary graphs over some cluster stop
reporting, showing zero activity, and then suddenly
after a while they come back up again.
This is very undesirable since I end up with several
white "holes" per day on each cluster.
The information I can give you so far is the following:
- The attached image shows what happens
- I have a master-slave type of configuration, where
the collector gmonds are sitting in the same machine
(the headnode) as gmetad and ganglia-web, and where
all the gmond nodes are reporting their metrics
through unicast to the headnode.
- I have the latest Ganglia versions running (both
core and web)
- All VM's are based on SL6
- When I look at /var/log/messages I see a lot of
this:
- May 19 16:14:36 gangliamon gmond[22292]: Error
1 sending the modular data for pkts_out#012
May 19 16:14:36 gangliamon gmond[22292]: Error 1
sending the modular data for heartbeat#012
May 19 16:14:36 gangliamon gmond[22292]: Error 1
sending the modular data for cpu_user#012
May 19 16:14:36 gangliamon gmond[22292]: Error 1
sending the modular data for cpu_system#012
May 19 16:14:36 gangliamon gmond[22292]: Error 1
sending the modular data for cpu_idle#012
May 19 16:14:36 gangliamon gmond[22292]: Error 1
sending the modular data for cpu_nice#012
May 19 16:14:36 gangliamon gmond[22292]: Error 1
sending the modular data for cpu_aidle#012
May 19 16:14:36 gangliamon gmond[22292]: Error 1
sending the modular data for cpu_wio#012
May 19 16:14:36 gangliamon gmond[22292]: Error 1
sending the modular data for cpu_steal#012
May 19 16:14:37 gangliamon gmond[22304]: Error 1
sending the modular data for heartbeat#012
May 19 16:14:38 gangliamon gmond[10560]: Error 1
sending the modular data for cpu_user#012
May 19 16:14:38 gangliamon gmond[10560]: Error 1
sending the modular data for cpu_system#012
May 19 16:14:38 gangliamon gmond[10560]: Error 1
sending the modular data for cpu_idle#012
May 19 16:14:38 gangliamon gmond[10560]: Error 1
sending the modular data for cpu_nice#012
May 19 16:14:38 gangliamon gmond[10560]: Error 1
sending the modular data for cpu_aidle#012
May 19 16:14:38 gangliamon gmond[10560]: Error 1
sending the modular data for cpu_wio#012
May 19 16:14:38 gangliamon gmond[10560]: Error 1
sending the modular data for cpu_steal#012
May 19 16:14:39 gangliamon gmond[22300]: Error 1
sending the modular data for mem_free#012
May 19 16:14:39 gangliamon gmond[22300]: Error 1
sending the modular data for mem_shared#012
May 19 16:14:39 gangliamon gmond[22300]: Error 1
sending the modular data for mem_buffers#012
May 19 16:14:39 gangliamon gmond[22300]: Error 1
sending the modular data for mem_cached#012
May 19 16:14:39 gangliamon gmond[22300]: Error 1
sending the modular data for swap_free#012
May 19 16:14:39 gangliamon gmond[22300]: Error 1
sending the modular data for bytes_out#012
May 19 16:14:39 gangliamon gmond[22300]: Error 1
sending the modular data for bytes_in#012
May 19 16:14:39 gangliamon gmond[22300]: Error 1
sending the modular data for pkts_in#012
May 19 16:14:39 gangliamon gmond[22300]: Error 1
sending the modular data for pkts_out#012
May 19 16:14:40 gangliamon gmond[10560]: Error 1
sending the modular data for heartbeat#012
May 19 16:14:42 gangliamon gmond[22304]: Error 1
sending the modular data for disk_free#012
....
Which I understand is a known unsolved issue, by
looking at other discussions like
https://www.mail-archive.com/ganglia-general@lists.sourceforge.net/msg06602.html
.
Does anyone know how to solve this?
Thanks
Cumprimentos / Best regards,
Cristóvão José Domingues Cordeiro
------------------------------------------------------------------------------
"Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
Instantly run your Selenium tests across 300+ browser/OS combos.
Get unparalleled scalability from the best Selenium testing platform available
Simple to use. Nothing to install. Get started now for free."
http://p.sf.net/sfu/SauceLabs
_______________________________________________
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general
|
------------------------------------------------------------------------------
"Accelerate Dev Cycles with Automated Cross-Browser Testing - For FREE
Instantly run your Selenium tests across 300+ browser/OS combos.
Get unparalleled scalability from the best Selenium testing platform available
Simple to use. Nothing to install. Get started now for free."
http://p.sf.net/sfu/SauceLabs
_______________________________________________
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general