Re: [Ganglia-developers] [RFC] two step gmond initialization

2009-12-13 Thread Daniel Pocock
Carlo Marcelo Arenas Belon wrote:
 On Fri, Dec 11, 2009 at 01:31:22PM -0600, Brooks Davis wrote:
   
 On Fri, Dec 11, 2009 at 04:56:51PM +, Carlo Marcelo Arenas Belon wrote:

 
 I presume the reason why you haven't seen this show up in the APR list, is
 because it makes probably more sense for the apache httpd list instead for
 help understanding how apache is able to work around the leakiness of
 apr_poll and that also requires some reading from apache's code (which I
 am not at least that familiar with, neither really interested)
   
 Looking at the prefork mpm, the pollsets are created and used only
 in child_main() and thus are created after the fork.  I suspect that
 changing the ganglia code to open all the sockets, but defer creation of
 the pollset until after fork is the right way to go.
 

 That is the way we did the initialization before r2025 so I guess that could
 explain why we weren't affected just like apache is not.
   
Not quite - pre-r2025, we did this:

a) detach
b) socket init
c) pollset init

Post r2025:

a) socket init
b) pollset init
c) detach

Brooks' solution:

a) socket init
b) detach
c) pollset init

I could accept Brooks' solution, because it means gmond would only fail 
for something like out-of-memory, while any configuration failure, port 
in use, etc would cause it to fail before detaching.

Basically, we would have to split the code in 
setup_listen_channels_pollset() into two functions, one that gets called 
before detaching, and one that is called after detaching.

--
Return on Information:
Google Enterprise Search pays you back
Get the facts.
http://p.sf.net/sfu/google-dev2dev
___
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers


Re: [Ganglia-developers] [RFC] two step gmond initialization

2009-12-13 Thread Carlo Marcelo Arenas Belon
On Sun, Dec 13, 2009 at 10:49:00AM +, Daniel Pocock wrote:
 Carlo Marcelo Arenas Belon wrote:
 On Fri, Dec 11, 2009 at 01:31:22PM -0600, Brooks Davis wrote:
   
 On Fri, Dec 11, 2009 at 04:56:51PM +, Carlo Marcelo Arenas Belon wrote:
 
 I presume the reason why you haven't seen this show up in the APR list, is
 because it makes probably more sense for the apache httpd list instead for
 help understanding how apache is able to work around the leakiness of
 apr_poll and that also requires some reading from apache's code (which I
 am not at least that familiar with, neither really interested)
   
 Looking at the prefork mpm, the pollsets are created and used only
 in child_main() and thus are created after the fork.  I suspect that
 changing the ganglia code to open all the sockets, but defer creation of
 the pollset until after fork is the right way to go.

 That is the way we did the initialization before r2025 so I guess that could
 explain why we weren't affected just like apache is not.
   
 Not quite - pre-r2025, we did this:

 a) detach
 b) socket init
 c) pollset init

 Post r2025:

 a) socket init
 b) pollset init
 c) detach

 Brooks' solution:

 a) socket init
 b) detach
 c) pollset init

 I could accept Brooks' solution, because it means gmond would only fail  
 for something like out-of-memory, while any configuration failure, port  
 in use, etc would cause it to fail before detaching.

If gmond still fails silently in some cases, you have not accomplished the
objective that you were trying to obtain with r2025 anyway.

The solution I proposed addresses the problem of reporting to the OS any
failure while initialization (which was the original bug to fix anyway)
in a straight forward way and is therefore the right way to correct this
IMHO, without introducing any regressions by changing long relied upon
semantics.

 Basically, we would have to split the code in  
 setup_listen_channels_pollset() into two functions, one that gets called  
 before detaching, and one that is called after detaching.

Why make the code more complicated, and are you really expecting to do that
in scope for getting it backported into 3.1.6 considering how intrusive that
would be?

Also be aware there are bugfixes on that code that hadn't yet been backported
and so you are going to either have to certify as well all those fixes or
cherry pick the changes needed and test all different combinations.

Carlo

--
Return on Information:
Google Enterprise Search pays you back
Get the facts.
http://p.sf.net/sfu/google-dev2dev
___
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers


Re: [Ganglia-developers] gmetad and rrdtool scalability

2009-12-13 Thread Vladimir Vuksan
I think you guys are complicating much :-). Can't you simply have multiple 
gmetads in different sites poll a single gmond. That way if one gmetad 
fails data is still available and updated on the other gmetads. That is 
what we used to do.

Vladimir

On Sun, 13 Dec 2009, Spike Spiegel wrote:

 indeed, os resources usage for caching should be tightly controlled.
 RRD does a pretty good job at that, and for example I know people that
 use collectd (which supports multiple output streams) and send data
 both remotely and keep a local copy with different retention policies
 to solve that problem.

 This would be addressed by the use of SAN - there would only be one RRD
 file, and the gmetad servers would need to be in some agreement so that they
 both don't try to write the same file at the same time.

 sure, but even with a SAN you'd have to add some intelligence to
 gmetad, which from my pov is more than half of the work needed to
 achieve gmetad reliability and redundancy while keeping it's current
 distributed design.

--
Return on Information:
Google Enterprise Search pays you back
Get the facts.
http://p.sf.net/sfu/google-dev2dev
___
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers