>>> On 11/29/2008 at 11:54 AM, in message <[EMAIL PROTECTED]>,
Kostas Georgiou <[EMAIL PROTECTED]> wrote:
> On Tue, Nov 04, 2008 at 10:02:32AM -0700, Brad Nicholes wrote:
> 
>> >>> On 11/3/2008 at  5:27 PM, in message <[EMAIL PROTECTED]>,
>> Kostas Georgiou <[EMAIL PROTECTED]> wrote: 
>> > On Mon, Nov 03, 2008 at 11:46:52PM +0000, Kostas Georgiou wrote:
>> > 
>> >> On Mon, Nov 03, 2008 at 01:55:22PM -0700, Brad Nicholes wrote:
>> >> > 
>> >> > If a timeout is set, then is the resulting XML output still good or did 
>> >> > we 
> 
>> > lose something because of the timeout?
>> >> 
>> >> No, it seems to be working fine. I am testing with:
>> > 
>> > Actually I was wrong there was enough data in the socket buffers to
>> > confuse me. The xml output is truncated in the slow reader :(
>> > 
>> 
>> Attached is a patch against trunk which implements a lingering close.
>> I am not sure if this will solve the problem but Apache does a similar
>> thing to make sure that both sides get a chance to complete the
>> conversation before closing the socket.  Apply this patch, let it run
>> for a while and let's see if this solves the problem.
> 
> I just got a non responsive gmond and after looking at the network
> traces it seems that:
> 
> gmond tries to write to what it thinks is a still alive connection
> so it is blocked there.
> On the gmetad side there is no such connection so the firewall replies
> with "ICMP host foo unreachable - admin prohibited". Unfortunately this
> doesn't cause the connection to be dropped on the gmond side (will
> anything else than RST work at this point?) and gmond keeps trying..
> 
> At this point it's too late to tell why the connection wasn't closed
> properly (was the FIN packet lost somehow?) but using a short keepalive
> setting in the gmond side can not hurt and will help in cases like this
> one.
> 

This is the real question.  Who initiated the close and why?  This would be 
much easier to debug if we could somehow figure out how to reproduce the 
problem reliably.  Since I am not able to reproduce this problem, I am 
wondering if it might have something to do with the OS or version of the socket 
library the OS is using.  We have heard of this happening on CentOS and 
Solaris.  Is there anything in common about the socket libraries between these 
two OSs?  I guess the band aid would be to put in a timeout and abort on 
gmond's write function.

Brad




-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Reply via email to