[Ganglia-general] Debugging problems with Ganglia

Douglas Wade Needham Thu, 17 Dec 2009 11:27:07 -0800

Greetings,

We are using ganglia where I work, and we have been running into
problems on the node running our 'main' gmond along with gmetad and
other processes which I have been trying to diagnose.  Unfortunately,
I did not set this up, and am just now looking at ganglia in any detail.


Here are the details:
        Version:        3.1.2
        Host:           VM running Debian Etch (1 VCPU, 512 MB) w/ 
2.6.26-2-xen-amd64 kernel
        HW Host:        Dual AMD Opteron 244's running at 1.8GHz
        Nodes:          64
        Metrics/node:   40 defined, yielding an average of 64 packets/min per 
host

We are using unicast messages to transmit the metrics from the nodes
to the gmond on our monitoring VM, and the gmetad on that machine
(with the default number of threads) pulls the data, and does what it
is supposed to... except...

When we have a program connect across our 1Gbps network connection to
this gmetad, we end up with very gappy data, if the hosts don't just
get marked as down and the RRDs stop updating.  I have already started
pressuring those who would approve moving our RRDs to a memory fs,
but in the meantime... :(

I have been running straces which have indicated that we occasionally
have threads which block on the futex() call for 10+ seconds, and
occasionally for as long as 500+ seconds.  To limit the impact of the
strace (which itself can cause the same problem), I even had to do:

        strace -f -tt -T -s 160 -e trace=process,futex,signal -o 
gmetad.strace.out2 -p 12618

But in doing this, I have come up with the following questions:

1) Is there any difference between '-d 1' and '-d 10'?  Or between
   'debug 1' and 'debug 10' in the config file?

   In looking through the code, it does not seem to be the case.  I
   would just like confirmation.

2) Am I seeing correctly that we have the following pthread_mutex
   definitions?

   - server_socket_mutex
   - server_interactive_mutex
   - Allocated mutex for root summary.
   - Allocated mutex for each grid partial-summary (1 per data source)
   - Allocated mutex for each cluster partial summary (1+ per data source)

3) Would there be any interests in patches against 3.1.2 to watch
   calls to pthread_mutex_lock() and pthread_mutex_unlock() to display
   when a call took more than a certain amount of time to return, or
   if a lock was held for longer than a certain time??

This last one comes, as given my suspicions on thread starvation, I am
going to have to instrument a gmetad a bit more to look at the mutexes
and how long we are in critical sections.

Thanks!

- Doug

-- 
Douglas Wade Needham - KA8ZRT        UN*X Consultant & UW/BSD kernel programmer
Email:  cinnion @ ka8zrt . com       http://www.ka8zrt.com
Disclaimer: My opinions are my own.  Since I don't want them, why
            should my employer, or anybody else for that matter! 

------------------------------------------------------------------------------
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev 
_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general

[Ganglia-general] Debugging problems with Ganglia

Reply via email to