We had ganglia deployed in the past but stopped using because of too much
network communication. I finally have the time to look back at ganglia and
other options and I have been reading the archives and seen lots of messages
in favor of and against multicast traffic; but haven't been able to get
unicast to work in my setup.
So I started thinking rather than going the unicast route, just increase the
time/limit thresholds on the metrics gmond is collecting and sending out. In
my environment we don't mind trading lag-time in data for less network
activity; our nodes are usually running a job at 100% cpu or not running a
job at all.
I tried looking for more information on a suggested "limited" configuration,
but wasn't able to find anything. Below is my current stab at my gmond.conf,
I would like to get some suggestions and advice on what to change. I
increased most of the time thresholds and removed some of the
value_thresholds (like for memory) altogether.
> collection_group {
> collect_once = yes
> time_threshold = 60
> metric {
> name = "heartbeat"
> }
> }
>
> /* This collection group will send general info about this host every 1200
> secs.
> This information doesn't change between reboots and is only collected once.
> */
> collection_group {
> collect_once = yes
> time_threshold = 3600
> metric {
> name = "cpu_num"
> }
...snip...
> metric {
> name = "location"
> }
> }
> collection_group {
> collect_every = 20
> time_threshold = 300
> /* CPU status */
> metric {
> name = "cpu_user"
> value_threshold = "20.0"
> }
> metric {
> name = "cpu_system"
> value_threshold = "20.0"
> }
> metric {
> name = "cpu_idle"
> value_threshold = "30.0"
> }
> metric {
> name = "cpu_nice"
> value_threshold = "20.0"
> }
> metric {
> name = "cpu_aidle"
> value_threshold = "20.0"
> }
> metric {
> name = "cpu_wio"
> value_threshold = "20.0"
> }
> }
>
> collection_group {
> collect_every = 20
> time_threshold = 300
> /* Load Averages */
> metric {
> name = "load_one"
> value_threshold = "20.0"
> }
> metric {
> name = "load_five"
> value_threshold = "20.0"
> }
> metric {
> name = "load_fifteen"
> value_threshold = "20.0"
> }
> }
>
> /* This group collects the number of running and total processes */
> collection_group {
> collect_every = 80
> time_threshold = 950
> metric {
> name = "proc_run"
> }
> metric {
> name = "proc_total"
> }
> }
>
> /* This collection group grabs the volatile memory metrics every 40 secs and
> sends them at least every 180 secs. This time_threshold can be increased
> significantly to reduce unneeded network traffic. */
> collection_group {
> collect_every = 40
> time_threshold = 720
> metric {
> name = "mem_free"
> }
> metric {
> name = "mem_shared"
> }
> metric {
> name = "mem_buffers"
> }
> metric {
> name = "mem_cached"
> }
> metric {
> name = "swap_free"
> }
> }
>
> collection_group {
> collect_every = 40
> time_threshold = 300
> metric {
> name = "bytes_out"
> value_threshold = 4096
> }
> metric {
> name = "bytes_in"
> value_threshold = 4096
> }
> metric {
> name = "pkts_in"
> value_threshold = 256
> }
> metric {
> name = "pkts_out"
> value_threshold = 256
> }
> }
>
> /* Different than 2.5.x default since the old config made no sense */
> collection_group {
> collect_every = 1800
> time_threshold = 3600
> metric {
> name = "disk_total"
> value_threshold = 20.0
> }
> }
>
> collection_group {
> collect_every = 40
> time_threshold = 300
> metric {
> name = "disk_free"
> value_threshold = 10.0
> }
> metric {
> name = "part_max_used"
> value_threshold = 10.0
> }
> }
Thanks,
+R
--
Ryan Dionne
System Analyst 16800 Greenspoint Park Dr.
GeoCenter, Inc. Suite 100-S
(281) 443-8150 Houston, TX 77060
Celebrating our 25th Anniversary: 1980 - 2005