Brad Nicholes wrote:
>>>> On 2/2/2010 at 6:23 AM, in message <4b682769.6000...@pocock.com.au>, Daniel
>>>>         
> Pocock <dan...@pocock.com.au> wrote:
>
>   
>> I've just been testing r2258 on CentOS 5.  rpmbuild runs successfully 
>> and the packages install and run.
>>
>> However, I notice that some of the tcpconn metrics are failing.  
>> tcpconn.py doesn't appear to have changed since r1658 (August 2008).  It 
>> is the only python module that is loaded by default.
>>
>> The commit mentions moving the netstat thread start - are you able to 
>> have a look at this Brad?
>>
>> You can get my tarball from http://www.pocock.com.au/ganglia/test if you 
>> need to.  It is bootstrapped on Debian 5.
>>
>>
>>     metric 'tcp_established' being collected now
>>     metric 'tcp_established' has value_threshold 1.000000
>>     metric 'tcp_listen' being collected now
>> [PYTHON] Can't call the metric handler function for [tcp_listen] in the 
>> python module [tcpconn].
>>
>> Traceback (most recent call last):
>>   File "/usr/lib/ganglia/python_modules/tcpconn.py", line 67, in 
>> TCP_Connections
>>     _WorkerThread.start()
>>   File "/usr/lib/python2.4/threading.py", line 410, in start
>>     assert not self.__started, "thread already started"
>> AssertionError: thread already started
>>     metric 'tcp_listen' has value_threshold 1.000000
>>     metric 'tcp_timewait' being collected now
>> [PYTHON] Can't call the metric handler function for [tcp_timewait] in 
>> the python module [tcpconn].
>>
>>     
>
> I can't reproduce the problem so all I can do is take a guess at what might 
> be happening and leave it to somebody who is seeing the issue to verify what 
> is happening.  The exception that you are seeing is a result of a thread 
> trying to be started multiple times.  There is an if statement in 
> TCP_connections() that is suppose to prevent this from happening.  This if 
> statement checks two thread variables that should indicate what state the 
> thread is in.  The running thread variable is set to false during thread 
> initialization and is set to true as soon as the threads run method is 
> called.  The run method is of the thread is called as a result of calling the 
> start() method on the thread object.  Each time that one of the tcpconn 
> metrcs is gathered, the metric callback hits the thread start if statement.  
> If the run thread variable is set to true, then no other metric invocation 
> should be allowed to start the thread again.  
>
>   
When you say you can't reproduce the problem, are you trying on a 
CentOS5/RHEL5 box, or something different?

> There is a very small window where, on initial startup, two metric callbacks 
> could get past the if statement in TCP_connections() and try to start the 
> thread a second time.  The windows would be caused by a delay between the 
> time that the start() method is called and when the threading module finally 
> calls the threads run() method.  We could add a try...catch block around the 
> start() call to catch and ignore the exception if the thread is started a 
> second time.  But the part that bothers me is that in the list of exceptions, 
> the thread was obviously attempted more than just a second time.  
>
> So my questions are, is the thread really running when the second or more 
> attempts are made?  Is the thread bailing out somewhere before the "running" 
> thread variable is set?  If we added the try...catch block and ignored the 
> thread, does this leave the thread running and in a functional state?  
> Without being able to reproduce the problem, I can't really answer these 
> questions.
>
>   
I don't know exactly how to check those things

What I can see is that the errors only appear when the daemon starts 
(maybe the first time it collects each metric).  After that, the values 
are transmitted.  Can you give any examples of how to debug this for 
someone who is not a Python expert?

Do you think this is a showstopper for 3.1.6?  I don't believe it can be 
a regression on this release only because tcpconn.py hasn't changed in 
in such a long time.  I'd like to try and tag 3.1.6 today or tomorrow, 
unless anyone has any issues that I've missed.





------------------------------------------------------------------------------
The Planet: dedicated and managed hosting, cloud storage, colocation
Stay online with enterprise data centers and the best network in the business
Choose flexible plans and management services without long-term contracts
Personal 24x7 support from experience hosting pros just a phone call away.
http://p.sf.net/sfu/theplanet-com
_______________________________________________
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers

Reply via email to