Hi all,

I'm running ganglia on a Debian node and a Solaris 8 x86 node to test it, and I'm running into problems on the Solaris node. The Debian node is currently the one running gmetad.

When I pring gconfd up on the solaris node, it works fine for a few minutes, but very shortly it is reported as being down by the web interface, even though it is very clearly not down and the process is still running.

When I run the daemon with any kind of debugging, this problem never occurs, but I get the following output in -d 1:

cpustuff: Permission denied
cpustuff: Permission denied
cpustuff: Permission denied
cpustuff: Permission denied
cpustuff: Permission denied
cpustuff: Permission denied
cpustuff: Not enough space
cpustuff: Not enough space


I just keep getting 'Not enough space' forever after that, but the daemon continues to function just fine.

If I run with with -d 2 or higher, I get the following type of output (this is multiple instances, pulled somewhat randomly):

set_metric_value() exec'd cpu_user_func (10)
cpustuff: Permission denied
offset = -20931796, cpu_now[1] = 346648
cpustuff: Permission denied
offset = -523107352, cpu_now[1] = 586301
Raw:  bread / bwrite / lread / lwrite / phread / phwrite
110481,110481,0 / 372254,372241,13 / 5049919,5049897,22 / 938684,938649,35 / 34,34,0 / 0,0,0
Aftermath: 0.000000 0.812500 1.375000 2.187500 0.000000 0.000000

XDR data successfully sent
mcast_value() mcasting os_release value
encoded 12 XDR bytes
XDR data successfully sent
set_metric_value() exec'd cpu_user_func (10)
* * * * Setting alpha to 0.016667 and beta to 0.983333 because timediff = 0
pre_process_node() received a new node
cpustuff: Permission denied
offset = -20931796, cpu_now[1] = 346731
cpustuff: Permission denied
offset = -523107352, cpu_now[1] = 586391
Raw:  bread / bwrite / lread / lwrite / phread / phwrite
110481,0,110481 / 372291,0,372291 / 5049968,0,5049968 / 938746,0,938746 / 34,0,34 / 0,0,0 Aftermath: 0.000100 0.000338 0.004586 0.000852 0.000000 0.000000 delta = 1101251722
** ** ** ** ** Are percentages electric?  Try -23%, -3% , 11% , 4% , 0% 0%
mcast_value() mcasting cpu_user value

set_metric_value() exec'd cpu_user_func (10)
cpustuff: Permission denied
offset = -20931796, cpu_now[1] = 346752
cpustuff: Permission denied
offset = -523107352, cpu_now[1] = 586423
Raw:  bread / bwrite / lread / lwrite / phread / phwrite
110481,110481,0 / 372317,372316,1 / 5050063,5050044,19 / 938796,938791,5 / 34,34,0 / 0,0,0 Aftermath: 0.000000 0.062500 1.187500 0.312500 0.000000 0.000000 delta = 16
** ** ** ** ** Are percentages electric?  Try -13%, 6% , 10% , 4% , 0% 0%
set_metric_value() exec'd bwrite_sec_func (31)


cpustuff: Not enough space
offset = -20931796, cpu_now[1] = 347152
cpustuff: Not enough space
offset = -523107352, cpu_now[1] = 587363
Raw:  bread / bwrite / lread / lwrite / phread / phwrite
110500,110488,12 / 372398,372389,9 / 5050309,5050246,63 / 938981,938962,19 / 34,34,0 / 0,0,0 Aftermath: 0.750000 0.562500 3.937500 1.187500 0.000000 0.000000 delta = 16
** ** ** ** ** Are percentages electric?  Try -13%, 6% , 10% , 4% , 0% 0%
set_metric_value() exec'd cpu_idle_func (13)
set_metric_value() exec'd lwrite_sec_func (33)


set_metric_value() exec'd cpu_user_func (10)
cpustuff: Not enough space
offset = -20931796, cpu_now[1] = 347389
cpustuff: Not enough space
offset = -523107352, cpu_now[1] = 588141
Raw:  bread / bwrite / lread / lwrite / phread / phwrite
110509,110509,0 / 372715,372700,15 / 5051982,5051912,70 / 939971,939912,59 / 34,34,0 / 0,0,0 Aftermath: 0.000000 0.937500 4.375000 3.687500 0.000000 0.000000 delta = 16
** ** ** ** ** Are percentages electric?  Try -13%, 6% , 10% , 4% , 0% 0%


Notice the 'Permission denied' and 'Not enough space' errors.

I can't seem to find either of those errors in the ganglia source tree, so they are apparently system errors. If I run the process under truss (by attaching a truss to the process after it's started successfully), the host is marked down within about 2 minutes, and I get the following in the output:

recvfrom(1, "\0\0\01D D 8 8B0", 1472, 0, 0xDF605958, 0xDF604DD4) = 8
lwp_sema_post(0xDF504E6C)                       = 0
lwp_sema_wait(0xDF504E6C)                       = 0
gettimeofday(0xDF605378)                        = 0
pause()                         (sleeping...)
signotifywait()                 (sleeping...)
lwp_sema_wait(0xDF504E6C)       (sleeping...)
accept(2, 0xDF302C4C, 0xDF302C44, 1) (sleeping...)
pread64(3, 0xDF100A08, 712, 0xFEC09B2C) (sleeping...)
lwp_cond_wait(0xDF8E95E8, 0xDF8E95D0, 0xDEC07D78) (sleeping...)
recvfrom(1, 0xDF605968, 1472, 0, 0xDF605958, 0xDF604DD4) (sleeping...)
lwp_cond_wait(0xDF8EFE10, 0xDF8EFE20, 0xDF8E9640) (sleeping...)
recvfrom(1, "\0\0\0\f >86 002", 1472, 0, 0xDF605958, 0xDF604DD4) = 8
lwp_sema_post(0xDF504E6C)                       = 0
lwp_sema_wait(0xDF504E6C)                       = 0
gettimeofday(0xDF605378)                        = 0
pause()                         (sleeping...)
signotifywait()                 (sleeping...)
lwp_sema_wait(0xDF504E6C)       (sleeping...)
accept(2, 0xDF302C4C, 0xDF302C44, 1) (sleeping...)
pread64(3, 0xDF100A08, 712, 0xFEC09B2C) (sleeping...)
lwp_cond_wait(0xDF8E95E8, 0xDF8E95D0, 0xDEC07D78) (sleeping...)
recvfrom(1, 0xDF605968, 1472, 0, 0xDF605958, 0xDF604DD4) (sleeping...)
lwp_cond_wait(0xDF8EFE10, 0xDF8EFE20, 0xDF8E9640) (sleeping...)
recvfrom(1, "\0\0\01A AA3B5 !", 1472, 0, 0xDF605958, 0xDF604DD4) = 8
lwp_sema_post(0xDF504E6C)                       = 0
lwp_sema_wait(0xDF504E6C)                       = 0
gettimeofday(0xDF604D10)                        = 0
pause()                         (sleeping...)
signotifywait()                 (sleeping...)
lwp_sema_wait(0xDF504E6C)       (sleeping...)
accept(2, 0xDF302C4C, 0xDF302C44, 1) (sleeping...)
pread64(3, 0xDF100A08, 712, 0xFEC09B2C) (sleeping...)
lwp_cond_wait(0xDF8E95E8, 0xDF8E95D0, 0xDEC07D78) (sleeping...)
recvfrom(1, 0xDF605968, 1472, 0, 0xDF605958, 0xDF604DD4) (sleeping...)
lwp_cond_wait(0xDF8EFE10, 0xDF8EFE20, 0xDF8E9640) (sleeping...)


It doesn't seem particularly useful, but hey, there is a '!' character in there... :)

This isn't exactly life-threatening, since I am only testing at this point and it still works fine in minimal debugging mode, but this seems pretty weird.

The process is running as the 'nobody' user, and everything was compiled with '--prefix=/usr/local --with-metad' using gcc 3.3.2.

Anyone have any ideas?

--
In our civilization, and under our republican form of government,
intelligence is so highly honored that it is rewarded by exemption from the
cares of office.
                    --Ambrose Bierce
---------------------------------------------------------------------
Luke Kanies | http://abstractive.org | http://www.bladelogic.com

Reply via email to