we have a setup with 2 unicast channels and we recently ran across an
issue where we lost a bunch of metrics submitted with gmetric due to a
problem with dns that made one of the two channels unreachable. I
traced this back to libgmond.c and
Ganglia_udp_send_channels_create(...) where the code exit(1) as soon
as it fails to create a socket (lines 323:344). I'm not sure if this
is intended or not, but it certainly damages redundant setups like
ours where we'd definitely prefer to have only some of the channels
getting data rather than all or nothing. I'd like to propose that the
behavior is changed so that the error_msg() + exit() is replaced with
a debug_msg() call and then outside of the loop and before the return
we check if any channel has been created at all and fail there in
case. I would have gone ahead and attach a patch, but I'm not familiar
with the apr API and was unsure what was the best approach to deal
with the send_channels array especially given that the code seems to
preallocate space for num_udp_send_channels (line 291).

thanks for your input,


"Behind every great man there's a great backpack" - B.

Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
Ganglia-developers mailing list

Reply via email to