>>> On 2/2/2010 at 6:23 AM, in message <4b682769.6000...@pocock.com.au>, Daniel
Pocock <dan...@pocock.com.au> wrote:

> 
> I've just been testing r2258 on CentOS 5.  rpmbuild runs successfully 
> and the packages install and run.
> 
> However, I notice that some of the tcpconn metrics are failing.  
> tcpconn.py doesn't appear to have changed since r1658 (August 2008).  It 
> is the only python module that is loaded by default.
> 
> The commit mentions moving the netstat thread start - are you able to 
> have a look at this Brad?
> 
> You can get my tarball from http://www.pocock.com.au/ganglia/test if you 
> need to.  It is bootstrapped on Debian 5.
> 
> 
>     metric 'tcp_established' being collected now
>     metric 'tcp_established' has value_threshold 1.000000
>     metric 'tcp_listen' being collected now
> [PYTHON] Can't call the metric handler function for [tcp_listen] in the 
> python module [tcpconn].
> 
> Traceback (most recent call last):
>   File "/usr/lib/ganglia/python_modules/tcpconn.py", line 67, in 
> TCP_Connections
>     _WorkerThread.start()
>   File "/usr/lib/python2.4/threading.py", line 410, in start
>     assert not self.__started, "thread already started"
> AssertionError: thread already started
>     metric 'tcp_listen' has value_threshold 1.000000
>     metric 'tcp_timewait' being collected now
> [PYTHON] Can't call the metric handler function for [tcp_timewait] in 
> the python module [tcpconn].
> 

I can't reproduce the problem so all I can do is take a guess at what might be 
happening and leave it to somebody who is seeing the issue to verify what is 
happening.  The exception that you are seeing is a result of a thread trying to 
be started multiple times.  There is an if statement in TCP_connections() that 
is suppose to prevent this from happening.  This if statement checks two thread 
variables that should indicate what state the thread is in.  The running thread 
variable is set to false during thread initialization and is set to true as 
soon as the threads run method is called.  The run method is of the thread is 
called as a result of calling the start() method on the thread object.  Each 
time that one of the tcpconn metrcs is gathered, the metric callback hits the 
thread start if statement.  If the run thread variable is set to true, then no 
other metric invocation should be allowed to start the thread again.  

There is a very small window where, on initial startup, two metric callbacks 
could get past the if statement in TCP_connections() and try to start the 
thread a second time.  The windows would be caused by a delay between the time 
that the start() method is called and when the threading module finally calls 
the threads run() method.  We could add a try...catch block around the start() 
call to catch and ignore the exception if the thread is started a second time.  
But the part that bothers me is that in the list of exceptions, the thread was 
obviously attempted more than just a second time.  

So my questions are, is the thread really running when the second or more 
attempts are made?  Is the thread bailing out somewhere before the "running" 
thread variable is set?  If we added the try...catch block and ignored the 
thread, does this leave the thread running and in a functional state?  Without 
being able to reproduce the problem, I can't really answer these questions.

Brad


------------------------------------------------------------------------------
The Planet: dedicated and managed hosting, cloud storage, colocation
Stay online with enterprise data centers and the best network in the business
Choose flexible plans and management services without long-term contracts
Personal 24x7 support from experience hosting pros just a phone call away.
http://p.sf.net/sfu/theplanet-com
_______________________________________________
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers

Reply via email to