Hello All, 
 
This is my current environment:  
3 clusters each running on AIX 5.3 ML 6. 
1 cluster made up of  5 nodes all running gmond (ver. 3.0.5) and 1 node 
designated the master running gmetad (ver. 3.0.5) 
1 cluster made up of  4 nodes all running gmond (ver. 3.0.5) and 1 node 
designated the master running gmetad (ver. 3.0.5)
1 cluster  made up of  2 nodes all running gmond (ver. 3.0.5) 
A Solaris 10 web server running the gmetad daemon (ver  3.0.5). 
 
The PROBLEM: 
The gmetad daemon on the web server will periodically  hang and prevents any 
new updates to the rrd databse. The only way around is to stop apache and kill 
(yes kill) the gmetad process, then restart. It will run fine for awhile then 
the hang occurs again.
 
The RESEARCH:
I have examined the apache access and error logs and they are clean.   I then 
reviewed the nohup startup file for gmetad  with logging verbosity turned to 
10. There are no errors appearing in this logfile. I then did a telnet to the 
gmond port of each client and successfully received the xml data.   I then 
decided to perform a truss on the gmetad pid and received the following info:
 
[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>  # truss -p 5944
/10:        Stopped by signal #24, SIGTSTP, in nanosleep()
/6:         Stopped by signal #24, SIGTSTP, in lwp_park()
/2:         Stopped by signal #24, SIGTSTP, in accept()
/7:         Stopped by signal #24, SIGTSTP, in lwp_park()
/4:         Stopped by signal #24, SIGTSTP, in accept()
/11:        Stopped by signal #24, SIGTSTP, in nanosleep()
/8:         Stopped by signal #24, SIGTSTP, in nanosleep()
/1:         Stopped by signal #24, SIGTSTP, in nanosleep()
/3:         Stopped by signal #24, SIGTSTP, in lwp_park()
/9:         Stopped by signal #24, SIGTSTP, in nanosleep()
/5:         Stopped by signal #24, SIGTSTP, in lwp_park()
 
It just seems to go to a sleep state with no warning or info. I have trouble 
shooted problems successfully in the past before in my ganglia configuration  
(ip changes, dir/file permissions...etc) but this one kinda got me scratching 
my head. Is this there a known issue with gmetad hanging during the polling 
process with this application? I can't  afford to have production performance 
data lost like this.  Is there anybody who can help?
Thanks,
-Leo


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Reply via email to