Hello,
I am testing deploing ganglia to monitor our servers. I have several
clusters - most of them are small ones, but I do have two large ones -
with over 150 machines to monitor. The issue is that I do not receive
all monitoring data from the machines in large clusters - ganglia-web
reports clusters down, in graphite and in rrd I see very few points
with data for machines in this large clustes - so by my calculations
2/3 of the data is lost. I am using gmond in unicast mode. Here are
examples of my configs:
Example of config in a monitored server:
globals {
daemonize = yes
setuid = yes
user = ganglia
debug_level = 0
max_udp_msg_len = 1472
mute = no
deaf = no
host_dmax = 86400 /*secs */
cleanup_threshold = 300 /*secs */
gexec = no
send_metadata_interval = 60
override_hostname = "<<!! HUMAN READABLE HOSTNAME !!>>"
}
cluster {
name = "Example large cluster"
owner = "unspecified"
latlong = "unspecified"
url = "unspecified"
}
udp_send_channel {
host = ip.addr.of.master
port = 8654
ttl = 1
}
udp_recv_channel {
port = 8649
}
tcp_accept_channel {
port = 8649
}
# Metric conf follows ...
Example of config of gmond collector on master node:
globals {
daemonize = yes
setuid = yes
user = ganglia
debug_level = 0
max_udp_msg_len = 1472
mute = no
deaf = no
allow_extra_data = yes
host_dmax = 86400 /*secs */
cleanup_threshold = 300 /*secs */
gexec = no
send_metadata_interval = 120
}
cluster {
name = "Example large cluster"
owner = "unspecified"
latlong = "unspecified"
url = "unspecified"
}
udp_send_channel {
host = localhost
port = 8654
ttl = 1
}
udp_recv_channel {
port = 8654
}
tcp_accept_channel {
port = 8654
}
And here is example of my gmetad.con:f:
data_source ...
data_source "Example large cluster" localhost:8654
data_source ...
server_threads 16
In logs I see a lots of "Error 1 sending the modular data data_source"
- searched various threads but did not found anything helpful.
I checked the network settings and tuned the udp accordingly - the
server do not drop packets, also checked on the switch - there are no
drops and loses. Load is rarely seen above 1.5 and this is 16 core
server with 128GB of ram. I ran the collector and gmeta in debug and it
seemed fine.
I am really lost, so I'll be grateful for any help.
------------------------------------------------------------------------------
_______________________________________________
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general