It's not so much that I "fear" multicast, it's just that I see no need for it.
Admittedly, my setup is not necessarily like others, and therefore my
preferences don't necessarily apply to other setups. And I wouldn't claim the
my setup is the "best" setup by any means. But if you are interested, I will
try to explain my personal resoning for not using multicast. (I have replied to
parts of your previous email below.)
Like Rick said below, the packets are small (60 bytes) and each node sends
only about 20/minute on average. Do the math, even with a large cluster you
are talking about a tiny fraction of the network's capacity.
This is not the whole picture (more on this below). But even if it is a small
amount, it's extra traffic that I do not need. At best, it is unnoticable and
serves no purpose. (So why keep it?) At worst, the UDP traffic intereferes with
my NFS traffic and results in extra RPC retransmits.
For us, redundancy is important because we have many clusters and do not want
to single out one node in each cluster to make it a special monitoring node
that needs to be up 24/7 in order to collect ganglia monitoring data.
I agree which is why our monitoring node is not part of any cluster. It is a
separate machine that provides other monitoring services, so it is expected to
be up 24x7 anyway. Besides, I can easily get redundancy by sending unicast to
two or three hosts. I don't need 400 copies of my cluster's data. And even if
I did have that many copies, when I modify gmetad.conf, I certainly don't plan
to enter 400 host names as alternative sources for gmond info.
With multicast's redundancy, we won't have to worry about the possibility
that the one special node out of our 400+ node cluster has crashed taking
ganglia down with it. If the node gmetad is currently getting data from has a
problem, it will transparently switch to another.
I don't worry about redundancy either, but this also fails to address the
availability of the gmetad process. Ganglia isn't very useful if the gmetad
server goes down and it is not recording the metrics to disk. Since I pretty
much need to make sure that gmetad is running 24x7 anyway, why not let it
collect gmond info? The server I run gmetad on is better equipped for
availability than any of the compute nodes in the cluster.
Our largest cluster has 427 linux nodes with a basic gmond using all of
the default metrics. I ran a tcpdump for about 10 minutes to capture
all of the multicast data in that cluster and here is what I found (*):
5,256,772 bytes collected in 628.45 seconds from 87,582 packets.
In other words, the average rate was:
139.362 packets/second
8364.668 bytes/second
0.066 Mbits/second
Thanks for supplying this data. It is a good baseline for people to use when
planning their Ganglia deployments. However, I add a lot more metrics than the
default. I estimate that I get about 500 packets/second which translates to
about 30,000 bytes/sec. Plus, this "background noise" scales linearly with the
number of nodes in the cluster.
I should also mention that the 60 byte metric packet actually consumes more than
60 bytes in the linux socket receive buffer. On my system, I ran gmond and then
sent it the STOP signal. I then used gmetric to submit a single metric by hand.
By looking at the rx_queue column in /proc/net/udp, I saw that the one metric
actually took up 304 bytes. The buffer fills up pretty quick and UDP packets
start dropping when the number of metrics gets big.
Then you also have to consider the bandwidth used by gmetad when it contacts
gmond. I tested having gmetad contact one of my cluster nodes to get the
cluster info. My test showed that it took about 0.7 secs for it to retreive
about 2.5 MB of data. That's 28.5 Mbps of the node's bandwidth used every 15
seconds. (In a previous post I also mentioned how I saw increased UDP drop
rates whenever the gmetad contacted a gmond.)
So those are my reasons for choosing unicast. Plenty of possibility for
problems, very little benefit. Other people's mileage may vary.
-- Rick
--------------------------
Rick Mohr
Systems Developer
Ohio Supercomputer Center