Thank You Rick and Martin for quick response!
I allready tried configuration that Rick suggest, but it doesn't work. In that
configuration I see only one node per data_source (the last one). One week
ago, Michael Chang helped me to solve problem with this configuration:
data_source "AEGIS01-PHY-SCL1" 147.91.83.201
data_source "AEGIS01-PHY-SCL2" 147.91.83.202
data_source "AEGIS01-PHY-SCL3" 147.91.83.203
.......
If I understand, Martin suggest that I need two machines with gmetad (one for
each data_source). Now I have gmetad only on server with web frontend
(se.phy.bg.ac.yu).
It is true that the machines in the two groups do not see each other. Even in
same group. I tried:
[EMAIL PROTECTED] root]# telnet localhost 8649 | grep grid
Connection closed by foreign host.
[EMAIL PROTECTED] root]#
Both machines ce and grid are in the same data_source with same gmond.conf
files. As you said, Martin, I found the problem, but I don't found solution
for them. :(
You can found my gmond and gmetad conf files in attachment.
-----
Branimir Ackovic
E-mail: [EMAIL PROTECTED]
Web: http://scl.phy.bg.ac.yu/
Phone: +381 11 3160260, Ext. 152
Fax: +381 11 3162190
Scientific Computing Laboratory
Institute of Physics, Belgrade
Serbia and Montenegro
-----
>I believe the problem stems from the fact that the cluster names used on
>the data_source lines do not match the names defined in the gmond.conf
>files. You may want to try something like this:
>data_source "AEGIS01-PHY-SCL Core Services" ce.phy.bg.ac.yu \
> se.phy.bg.ac.yu grid.phy.bg.ac.yu rb.phy.bg.ac.yu
>data_source "AEGIS01-PHY-SCL" wn01.phy.bg.ac.yu wn02.phy.bg.ac.yu ...
>-- Rick
> Anyway, could you please post the two different "gmond.conf" files and
>the "gmetad.conf" file?
> I have the impression that the machines in the two groups do not see
>each other. At least one machine in each group should see the metrics
>of its partner machines. In "gmetad.conf" you would use that machine as
>data source. Basically you should only have two data sources in your
>gmetad.conf
> Simple test. Log into one of the servers and do a "telnet localhost
>gmond-port". It should show you the data of all hosts in that group
>(grep for "HOST NAME"). If it only shows its own data you have found
>the problem.
>Cheers
>Martin
# This is an example of a Ganglia Meta Daemon configuration file
# http://ganglia.sourceforge.net/
#
# $Id: gmetad.conf,v 1.17 2005/03/15 18:15:05 massie Exp $
#
#-------------------------------------------------------------------------------
# Setting the debug_level to 1 will keep daemon in the forground and
# show only error messages. Setting this value higher than 1 will make
# gmetad output debugging information and stay in the foreground.
# default: 0
# debug_level 10
#
#-------------------------------------------------------------------------------
# What to monitor. The most important section of this file.
#
# The data_source tag specifies either a cluster or a grid to
# monitor. If we detect the source is a cluster, we will maintain a complete
# set of RRD databases for it, which can be used to create historical
# graphs of the metrics. If the source is a grid (it comes from another gmetad),
# we will only maintain summary RRDs for it.
#
# Format:
# data_source "my cluster" [polling interval] address1:port addreses2:port ...
#
# The keyword 'data_source' must immediately be followed by a unique
# string which identifies the source, then an optional polling interval in
# seconds. The source will be polled at this interval on average.
# If the polling interval is omitted, 15sec is asssumed.
#
# A list of machines which service the data source follows, in the
# format ip:port, or name:port. If a port is not specified then 8649
# (the default gmond port) is assumed.
# default: There is no default value
#
# data_source "my cluster" 10 localhost my.machine.edu:8649 1.2.3.5:8655
# data_source "my grid" 50 1.3.4.7:8655 grid.org:8651 grid-backup.org:8651
# data_source "another source" 1.3.4.7:8655 1.3.4.8
data_source "AEGIS01-PHY-SCL Core Services1" 147.91.83.217
data_source "AEGIS01-PHY-SCL Core Services2" 147.91.83.218
data_source "AEGIS01-PHY-SCL Core Services3" 147.91.83.219
data_source "AEGIS01-PHY-SCL Core Services4" 147.91.83.220
data_source "AEGIS01-PHY-SCL1" 147.91.83.201
data_source "AEGIS01-PHY-SCL2" 147.91.83.202
data_source "AEGIS01-PHY-SCL3" 147.91.83.203
data_source "AEGIS01-PHY-SCL4" 147.91.83.204
data_source "AEGIS01-PHY-SCL5" 147.91.83.205
data_source "AEGIS01-PHY-SCL6" 147.91.83.206
data_source "AEGIS01-PHY-SCL7" 147.91.83.207
data_source "AEGIS01-PHY-SCL8" 147.91.83.208
#
# Round-Robin Archives
# You can specify custom Round-Robin archives here (defaults are listed below)
#
# RRAs "RRA:AVERAGE:0.5:1:240" "RRA:AVERAGE:0.5:24:240"
"RRA:AVERAGE:0.5:168:240" "RRA:AVERAGE:0.5:672:240" \
# "RRA:AVERAGE:0.5:5760:370"
#
#
#-------------------------------------------------------------------------------
# Scalability mode. If on, we summarize over downstream grids, and respect
# authority tags. If off, we take on 2.5.0-era behavior: we do not wrap our
output
# in <GRID></GRID> tags, we ignore all <GRID> tags we see, and always assume
# we are the "authority" on data source feeds. This approach does not scale to
# large groups of clusters, but is provided for backwards compatibility.
# default: on
# scalable off
#
#-------------------------------------------------------------------------------
# The name of this Grid. All the data sources above will be wrapped in a GRID
# tag with this name.
# default: Unspecified
gridname "AEGIS01 PHY SCL"
#
#-------------------------------------------------------------------------------
# The authority URL for this grid. Used by other gmetads to locate graphs
# for our data sources. Generally points to a ganglia/
# website on this machine.
# default: "http://hostname/ganglia/",
# where hostname is the name of this machine, as defined by gethostname().
# authority "http://mycluster.org/newprefix/"
#
#-------------------------------------------------------------------------------
# List of machines this gmetad will share XML with. Localhost
# is always trusted.
# default: There is no default value
trusted_hosts 127.0.0.1
#
#-------------------------------------------------------------------------------
# If you want any host which connects to the gmetad XML to receive
# data, then set this value to "on"
# default: off
# all_trusted on
#
#-------------------------------------------------------------------------------
# If you don't want gmetad to setuid then set this to off
# default: on
# setuid off
#
#-------------------------------------------------------------------------------
# User gmetad will setuid to (defaults to "nobody")
# default: "nobody"
# setuid_username "nobody"
#
#-------------------------------------------------------------------------------
# The port gmetad will answer requests for XML
# default: 8651
# xml_port 8651
#
#-------------------------------------------------------------------------------
# The port gmetad will answer queries for XML. This facility allows
# simple subtree and summation views of the XML tree.
# default: 8652
# interactive_port 8652
#
#-------------------------------------------------------------------------------
# The number of threads answering XML requests
# default: 4
# server_threads 10
#
#-------------------------------------------------------------------------------
# Where gmetad stores its round-robin databases
# default: "/var/lib/ganglia/rrds"
# rrd_rootdir "/some/other/place"
/* This configuration is as close to 2.5.x default behavior as possible
The values closely match ./gmond/metric.h definitions in 2.5.x */
globals {
setuid = yes
user = nobody
cleanup_threshold = 300 /*secs */
}
/* If a cluster attribute is specified, then all gmond hosts are wrapped inside
* of a <CLUSTER> tag. If you do not specify a cluster tag, then all <HOSTS> will
* NOT be wrapped inside of a <CLUSTER> tag. */
cluster {
name = "AEGIS01-PHY-SCL"
}
/* Feel free to specify as many udp_send_channels as you like. Gmond
used to only support having a single channel */
udp_send_channel {
mcast_join = 239.2.11.71
port = 8649
}
/* You can specify as many udp_recv_channels as you like as well. */
udp_recv_channel {
mcast_join = 239.2.11.71
port = 8649
bind = 239.2.11.71
}
/* You can specify as many tcp_accept_channels as you like to share
an xml description of the state of the cluster */
tcp_accept_channel {
port = 8649
}
/* The old internal 2.5.x metric array has been replaced by the following
collection_group directives. What follows is the default behavior for
collecting and sending metrics that is as close to 2.5.x behavior as
possible. */
/* This collection group will cause a heartbeat (or beacon) to be sent every
20 seconds. In the heartbeat is the GMOND_STARTED data which expresses
the age of the running gmond. */
collection_group {
collect_once = yes
time_threshold = 20
metric {
name = "heartbeat"
}
}
/* This collection group will send general info about this host every 1200 secs.
This information doesn't change between reboots and is only collected once. */
collection_group {
collect_once = yes
time_threshold = 1200
metric {
name = "cpu_num"
}
metric {
name = "cpu_speed"
}
metric {
name = "mem_total"
}
/* Should this be here? Swap can be added/removed between reboots. */
metric {
name = "swap_total"
}
metric {
name = "boottime"
}
metric {
name = "machine_type"
}
metric {
name = "os_name"
}
metric {
name = "os_release"
}
metric {
name = "location"
}
}
/* This collection group will send the status of gexecd for this host every 300 secs */
/* Unlike 2.5.x the default behavior is to report gexecd OFF. */
collection_group {
collect_once = yes
time_threshold = 300
metric {
name = "gexec"
}
}
/* This collection group will collect the CPU status info every 20 secs.
The time threshold is set to 90 seconds. In honesty, this time_threshold could be
set significantly higher to reduce unneccessary network chatter. */
collection_group {
collect_every = 20
time_threshold = 90
/* CPU status */
metric {
name = "cpu_user"
value_threshold = "1.0"
}
metric {
name = "cpu_system"
value_threshold = "1.0"
}
metric {
name = "cpu_idle"
value_threshold = "5.0"
}
metric {
name = "cpu_nice"
value_threshold = "1.0"
}
metric {
name = "cpu_aidle"
value_threshold = "5.0"
}
metric {
name = "cpu_wio"
value_threshold = "1.0"
}
/* The next two metrics are optional if you want more detail...
... since they are accounted for in cpu_system.
metric {
name = "cpu_intr"
value_threshold = "1.0"
}
metric {
name = "cpu_sintr"
value_threshold = "1.0"
}
*/
}
collection_group {
collect_every = 20
time_threshold = 90
/* Load Averages */
metric {
name = "load_one"
value_threshold = "1.0"
}
metric {
name = "load_five"
value_threshold = "1.0"
}
metric {
name = "load_fifteen"
value_threshold = "1.0"
}
}
/* This group collects the number of running and total processes */
collection_group {
collect_every = 80
time_threshold = 950
metric {
name = "proc_run"
value_threshold = "1.0"
}
metric {
name = "proc_total"
value_threshold = "1.0"
}
}
/* This collection group grabs the volatile memory metrics every 40 secs and
sends them at least every 180 secs. This time_threshold can be increased
significantly to reduce unneeded network traffic. */
collection_group {
collect_every = 40
time_threshold = 180
metric {
name = "mem_free"
value_threshold = "1024.0"
}
metric {
name = "mem_shared"
value_threshold = "1024.0"
}
metric {
name = "mem_buffers"
value_threshold = "1024.0"
}
metric {
name = "mem_cached"
value_threshold = "1024.0"
}
metric {
name = "swap_free"
value_threshold = "1024.0"
}
}
collection_group {
collect_every = 40
time_threshold = 300
metric {
name = "bytes_out"
value_threshold = 4096
}
metric {
name = "bytes_in"
value_threshold = 4096
}
metric {
name = "pkts_in"
value_threshold = 256
}
metric {
name = "pkts_out"
value_threshold = 256
}
}
/* Different than 2.5.x default since the old config made no sense */
collection_group {
collect_every = 1800
time_threshold = 3600
metric {
name = "disk_total"
value_threshold = 1.0
}
}
collection_group {
collect_every = 40
time_threshold = 180
metric {
name = "disk_free"
value_threshold = 1.0
}
metric {
name = "part_max_used"
value_threshold = 1.0
}
}
/* This configuration is as close to 2.5.x default behavior as possible
The values closely match ./gmond/metric.h definitions in 2.5.x */
globals {
setuid = yes
user = nobody
cleanup_threshold = 300 /*secs */
}
/* If a cluster attribute is specified, then all gmond hosts are wrapped inside
* of a <CLUSTER> tag. If you do not specify a cluster tag, then all <HOSTS> will
* NOT be wrapped inside of a <CLUSTER> tag. */
cluster {
name = "AEGIS01-PHY-SCL Core Services"
}
/* Feel free to specify as many udp_send_channels as you like. Gmond
used to only support having a single channel */
udp_send_channel {
mcast_join = 239.2.11.71
port = 8649
}
/* You can specify as many udp_recv_channels as you like as well. */
udp_recv_channel {
mcast_join = 239.2.11.71
port = 8649
bind = 239.2.11.71
}
/* You can specify as many tcp_accept_channels as you like to share
an xml description of the state of the cluster */
tcp_accept_channel {
port = 8649
}
/* The old internal 2.5.x metric array has been replaced by the following
collection_group directives. What follows is the default behavior for
collecting and sending metrics that is as close to 2.5.x behavior as
possible. */
/* This collection group will cause a heartbeat (or beacon) to be sent every
20 seconds. In the heartbeat is the GMOND_STARTED data which expresses
the age of the running gmond. */
collection_group {
collect_once = yes
time_threshold = 20
metric {
name = "heartbeat"
}
}
/* This collection group will send general info about this host every 1200 secs.
This information doesn't change between reboots and is only collected once. */
collection_group {
collect_once = yes
time_threshold = 1200
metric {
name = "cpu_num"
}
metric {
name = "cpu_speed"
}
metric {
name = "mem_total"
}
/* Should this be here? Swap can be added/removed between reboots. */
metric {
name = "swap_total"
}
metric {
name = "boottime"
}
metric {
name = "machine_type"
}
metric {
name = "os_name"
}
metric {
name = "os_release"
}
metric {
name = "location"
}
}
/* This collection group will send the status of gexecd for this host every 300 secs */
/* Unlike 2.5.x the default behavior is to report gexecd OFF. */
collection_group {
collect_once = yes
time_threshold = 300
metric {
name = "gexec"
}
}
/* This collection group will collect the CPU status info every 20 secs.
The time threshold is set to 90 seconds. In honesty, this time_threshold could be
set significantly higher to reduce unneccessary network chatter. */
collection_group {
collect_every = 20
time_threshold = 90
/* CPU status */
metric {
name = "cpu_user"
value_threshold = "1.0"
}
metric {
name = "cpu_system"
value_threshold = "1.0"
}
metric {
name = "cpu_idle"
value_threshold = "5.0"
}
metric {
name = "cpu_nice"
value_threshold = "1.0"
}
metric {
name = "cpu_aidle"
value_threshold = "5.0"
}
metric {
name = "cpu_wio"
value_threshold = "1.0"
}
/* The next two metrics are optional if you want more detail...
... since they are accounted for in cpu_system.
metric {
name = "cpu_intr"
value_threshold = "1.0"
}
metric {
name = "cpu_sintr"
value_threshold = "1.0"
}
*/
}
collection_group {
collect_every = 20
time_threshold = 90
/* Load Averages */
metric {
name = "load_one"
value_threshold = "1.0"
}
metric {
name = "load_five"
value_threshold = "1.0"
}
metric {
name = "load_fifteen"
value_threshold = "1.0"
}
}
/* This group collects the number of running and total processes */
collection_group {
collect_every = 80
time_threshold = 950
metric {
name = "proc_run"
value_threshold = "1.0"
}
metric {
name = "proc_total"
value_threshold = "1.0"
}
}
/* This collection group grabs the volatile memory metrics every 40 secs and
sends them at least every 180 secs. This time_threshold can be increased
significantly to reduce unneeded network traffic. */
collection_group {
collect_every = 40
time_threshold = 180
metric {
name = "mem_free"
value_threshold = "1024.0"
}
metric {
name = "mem_shared"
value_threshold = "1024.0"
}
metric {
name = "mem_buffers"
value_threshold = "1024.0"
}
metric {
name = "mem_cached"
value_threshold = "1024.0"
}
metric {
name = "swap_free"
value_threshold = "1024.0"
}
}
collection_group {
collect_every = 40
time_threshold = 300
metric {
name = "bytes_out"
value_threshold = 4096
}
metric {
name = "bytes_in"
value_threshold = 4096
}
metric {
name = "pkts_in"
value_threshold = 256
}
metric {
name = "pkts_out"
value_threshold = 256
}
}
/* Different than 2.5.x default since the old config made no sense */
collection_group {
collect_every = 1800
time_threshold = 3600
metric {
name = "disk_total"
value_threshold = 1.0
}
}
collection_group {
collect_every = 40
time_threshold = 180
metric {
name = "disk_free"
value_threshold = 1.0
}
metric {
name = "part_max_used"
value_threshold = 1.0
}
}