Hi Leo:

On 2/26/08, Albee, Leo <[EMAIL PROTECTED]> wrote:

>  This is my current environment:
>  3 clusters each running on AIX 5.3 ML 6.
>  1 cluster made up of  5 nodes all running gmond (ver. 3.0.5) and 1 node 
> designated the master running gmetad (ver. 3.0.5)
>  1 cluster made up of  4 nodes all running gmond (ver. 3.0.5) and 1 node 
> designated the master running gmetad (ver. 3.0.5)
>  1 cluster  made up of  2 nodes all running gmond (ver. 3.0.5)
>  A Solaris 10 web server running the gmetad daemon (ver  3.0.5).
>
>  The PROBLEM:
>  The gmetad daemon on the web server will periodically  hang and prevents any 
> new updates to the rrd databse. The only way around is to stop apache and 
> kill (yes kill) the gmetad process, then restart. It will run fine for awhile 
> then the hang occurs again.
>
>  The RESEARCH:
>  I have examined the apache access and error logs and they are clean.   I 
> then reviewed the nohup startup file for gmetad  with logging verbosity 
> turned to 10. There are no errors appearing in this logfile. I then did a 
> telnet to the gmond port of each client and successfully received the xml 
> data.   I then decided to perform a truss on the gmetad pid and received the 
> following info:
>
>  [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>  # truss -p 5944
>  /10:        Stopped by signal #24, SIGTSTP, in nanosleep()
>  /6:         Stopped by signal #24, SIGTSTP, in lwp_park()
>  /2:         Stopped by signal #24, SIGTSTP, in accept()
>  /7:         Stopped by signal #24, SIGTSTP, in lwp_park()
>  /4:         Stopped by signal #24, SIGTSTP, in accept()
>  /11:        Stopped by signal #24, SIGTSTP, in nanosleep()
>  /8:         Stopped by signal #24, SIGTSTP, in nanosleep()
>  /1:         Stopped by signal #24, SIGTSTP, in nanosleep()
>  /3:         Stopped by signal #24, SIGTSTP, in lwp_park()
>  /9:         Stopped by signal #24, SIGTSTP, in nanosleep()
>  /5:         Stopped by signal #24, SIGTSTP, in lwp_park()
>
>  It just seems to go to a sleep state with no warning or info. I have trouble 
> shooted problems successfully in the past before in my ganglia configuration  
> (ip changes, dir/file permissions...etc) but this one kinda got me scratching 
> my head. Is this there a known issue with gmetad hanging during the polling 
> process with this application? I can't  afford to have production performance 
> data lost like this.  Is there anybody who can help?

Perhaps you can post the gmetad.conf on your Solaris 10 server.  BTW,
would DTrace be able to give you more information about the hanging
gmetad process?

Cheers,

Bernard

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Reply via email to