>>> At least a revert would be needed for 3.1 as this accounts for a regression
>>> but haven't done so either waiting for you to first revert it on trunk and
>>> then decide on how to proceed from there depending on how critical this
>>> feature was for the release.
>>>   
>>>       
>> I agree that it is a recession, but reverting it may cause the real  
>> culprit to remain hidden.  I'd rather hold the release while we look  
>> more closely.
>>     
>
> not sure if I understand what you meant here, since it would be obvious to
> me that 3.1.5 can't be released if a fix (even if it is just reverting the
> change) is committed.
>   
Maybe a more gung-ho release manager would pull out the change and 
release anyway (but I'll resist the urge to name any company in particular)
> are you saying you want to hold of on deciding to release or not 3.1.5 or
> to see what will be in 3.1.6?, if the later I would suggest also pulling
> some other fixes and of course that would also require for us to agree
> on a bootstrapping environment for this release at least.
>
>   
I propose we now aim for 3.1.6, which should fix this issue, and may 
also take in some other fixes too.  Given that the last few release 
candidates have been scrapped, we should aim to get 3.1.6 out with 
minimal extra changes though, just essential bug fixes, and maybe slip 
in the PCRE patch.  Agreeing the bootstrap issue is a definite 
prerequisite for the 3.1.6 release.
>>>> The change has been working on Linux, Solaris and Cygwin.
>>>>         
>>> Other than just doing a manual bisect (using git instead of svn here would
>>> had been useful) to find where the problem was introduced and validate that
>>> reverting it corrects the problem haven't done much analysis of it, but the
>>> fact that it broke in such a strange way (was indeed expecting the culprit
>>> to be somewhere else, specially considering all recent changes in the
>>> networking and the fact that it seemed originally to be triggered by a TCP
>>> request) probably points to a bigger issue which just happens to have not
>>> been visible on the configurations used to test Linux, Solaris and Cygwin,
>>> specially considering how pervasive it was (broke all BSD I had access to
>>> test, at least)
>>>   
>>>       
>> Can you provide output from strace/truss and also a stack trace from the  
>> point where it is in the infinite loop?
>>     
>
> filed BUG246 with the trace information (collected from OpenBSD 4.5 amd64)
> using ktrace, but you got me there.
>
> from the way the problem represents itself isn't really obvious were the
> offending code is and is difficult to debug as well since it dissapears
> when in debug mode or not running as a daemon, which is the reason why
> I haven't been able to capture a backtrace yet either.
>   
For me to set up a fresh OpenBSD VM may take a couple of hours, I would 
much rather spend that time on other Ganglia coding, therefore I would 
certainly appreciate it if you could help get to the bottom of this issue.

http://bugzilla.ganglia.info/cgi-bin/bugzilla/show_bug.cgi?id=246

 From looking at the bug report:

- Could it be a security issue?  Can you try disabling setuid?  It 
appears that listen channels are only set up after setuid, but maybe 
there is something else.

- Have you tried different versions of APR?  E.g. on RHEL5, I test with 
the native apr-1.2.7, and on Debian I have 1.2.12-5

- Can you easily re-compile APR with a different poll implementation?  I 
think you can change it from configure.

- If you take 3.1.2 or another release and apply this patch only, do you 
see the same bug?

- Could it be closing the wrong socket at some point when daemonizing?  
I had an odd problem with rtpproxy some time ago where it closed 
stdin/stdout/stderr, descriptors 0, 1 and 2 got re-used by other 
sockets, and some stray calls to fprintf(stderr,...) caused mayhem.

- Is everything that is done pre-daemonize meant to be safe to pass to a 
child process?  Normally memory allocations, sockets, etc should all be 
available to the child, so I think we should be fine.

Here is the way apr_proc_detach() is defined:

http://svn.apache.org/repos/asf/apr/apr/trunk/threadproc/unix/procsup.c


>   
>> There is a good reason for moving the daemonize code the way I did - an  
>> alternative would be to daemonize, but make the original process hang  
>> around until the daemon process has entered the main loop.
>>     
>
> OK, and assume it is probably related to the cases were gmond "suddenly"
> dies at startup without notification but some clarification on what was
> the problem you were trying to solve would be probably usefull too.
>   
Yes, exactly - pre-3.1.3, if gmond failed to start, the init script 
would still say [OK] because gmond always exited with the return code 0 
when daemonising.

The [OK] from the init script was misleading.  Although you could easily 
check that the process wasn't really running, and you can usually find 
out why by looking at the log, I think it is better to have the init 
script report [FAILED]



------------------------------------------------------------------------------
Join us December 9, 2009 for the Red Hat Virtual Experience,
a free event focused on virtualization and cloud computing. 
Attend in-depth sessions from your desk. Your couch. Anywhere.
http://p.sf.net/sfu/redhat-sfdev2dev
_______________________________________________
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Reply via email to