Hello,

I am testing deploing ganglia to monitor our servers. I have several clusters - most of them are small ones, but I do have two large ones - with over 150 machines to monitor. The issue is that I do not receive all monitoring data from the machines in large clusters - ganglia-web reports clusters down, in graphite and in rrd I see very few points with data for machines in this large clustes - so by my calculations 2/3 of the data is lost. I am using gmond in unicast mode. Here are examples of my configs:


Example of config in a monitored server:

globals {
 daemonize = yes
 setuid = yes
 user = ganglia
 debug_level = 0
 max_udp_msg_len = 1472
 mute = no
 deaf = no
 host_dmax = 86400 /*secs */
 cleanup_threshold = 300 /*secs */
 gexec = no
 send_metadata_interval = 60
 override_hostname = "<<!! HUMAN READABLE HOSTNAME !!>>"
}
cluster {
 name = "Example large cluster"
 owner = "unspecified"
 latlong = "unspecified"
 url = "unspecified"
}
udp_send_channel {
 host = ip.addr.of.master
 port = 8654
 ttl = 1
}
udp_recv_channel {
 port = 8649
}
tcp_accept_channel {
 port = 8649
}
# Metric conf follows ...

Example of config of gmond collector on master node:

globals {
 daemonize = yes
 setuid = yes
 user = ganglia
 debug_level = 0
 max_udp_msg_len = 1472
 mute = no
 deaf = no
 allow_extra_data = yes
 host_dmax = 86400 /*secs */
 cleanup_threshold = 300 /*secs */
 gexec = no
 send_metadata_interval = 120
}
cluster {
 name = "Example large cluster"
 owner = "unspecified"
 latlong = "unspecified"
 url = "unspecified"
}
udp_send_channel {
 host = localhost
 port = 8654
 ttl = 1
}
udp_recv_channel {
 port = 8654
}
tcp_accept_channel {
 port = 8654
}


And here is example of my gmetad.con:f:

data_source ...
data_source "Example large cluster" localhost:8654
data_source ...

server_threads 16


In logs I see a lots of "Error 1 sending the modular data data_source" - searched various threads but did not found anything helpful. I checked the network settings and tuned the udp accordingly - the server do not drop packets, also checked on the switch - there are no drops and loses. Load is rarely seen above 1.5 and this is 16 core server with 128GB of ram. I ran the collector and gmeta in debug and it seemed fine.

I am really lost, so I'll be grateful for any help.


------------------------------------------------------------------------------
_______________________________________________
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Reply via email to