Hi all,
I'm running ganglia on a Debian node and a Solaris 8 x86 node to test
it, and I'm running into problems on the Solaris node. The Debian node
is currently the one running gmetad.
When I pring gconfd up on the solaris node, it works fine for a few
minutes, but very shortly it is reported as being down by the web
interface, even though it is very clearly not down and the process is
still running.
When I run the daemon with any kind of debugging, this problem never
occurs, but I get the following output in -d 1:
cpustuff: Permission denied
cpustuff: Permission denied
cpustuff: Permission denied
cpustuff: Permission denied
cpustuff: Permission denied
cpustuff: Permission denied
cpustuff: Not enough space
cpustuff: Not enough space
I just keep getting 'Not enough space' forever after that, but the
daemon continues to function just fine.
If I run with with -d 2 or higher, I get the following type of output
(this is multiple instances, pulled somewhat randomly):
set_metric_value() exec'd cpu_user_func (10)
cpustuff: Permission denied
offset = -20931796, cpu_now[1] = 346648
cpustuff: Permission denied
offset = -523107352, cpu_now[1] = 586301
Raw: bread / bwrite / lread / lwrite / phread / phwrite
110481,110481,0 / 372254,372241,13 / 5049919,5049897,22 /
938684,938649,35 / 34,34,0 / 0,0,0
Aftermath: 0.000000 0.812500 1.375000 2.187500 0.000000 0.000000
XDR data successfully sent
mcast_value() mcasting os_release value
encoded 12 XDR bytes
XDR data successfully sent
set_metric_value() exec'd cpu_user_func (10)
* * * * Setting alpha to 0.016667 and beta to 0.983333 because timediff = 0
pre_process_node() received a new node
cpustuff: Permission denied
offset = -20931796, cpu_now[1] = 346731
cpustuff: Permission denied
offset = -523107352, cpu_now[1] = 586391
Raw: bread / bwrite / lread / lwrite / phread / phwrite
110481,0,110481 / 372291,0,372291 / 5049968,0,5049968 / 938746,0,938746
/ 34,0,34 / 0,0,0
Aftermath: 0.000100 0.000338 0.004586 0.000852 0.000000 0.000000 delta =
1101251722
** ** ** ** ** Are percentages electric? Try -23%, -3% , 11% , 4% , 0% 0%
mcast_value() mcasting cpu_user value
set_metric_value() exec'd cpu_user_func (10)
cpustuff: Permission denied
offset = -20931796, cpu_now[1] = 346752
cpustuff: Permission denied
offset = -523107352, cpu_now[1] = 586423
Raw: bread / bwrite / lread / lwrite / phread / phwrite
110481,110481,0 / 372317,372316,1 / 5050063,5050044,19 / 938796,938791,5
/ 34,34,0 / 0,0,0
Aftermath: 0.000000 0.062500 1.187500 0.312500 0.000000 0.000000 delta = 16
** ** ** ** ** Are percentages electric? Try -13%, 6% , 10% , 4% , 0% 0%
set_metric_value() exec'd bwrite_sec_func (31)
cpustuff: Not enough space
offset = -20931796, cpu_now[1] = 347152
cpustuff: Not enough space
offset = -523107352, cpu_now[1] = 587363
Raw: bread / bwrite / lread / lwrite / phread / phwrite
110500,110488,12 / 372398,372389,9 / 5050309,5050246,63 /
938981,938962,19 / 34,34,0 / 0,0,0
Aftermath: 0.750000 0.562500 3.937500 1.187500 0.000000 0.000000 delta = 16
** ** ** ** ** Are percentages electric? Try -13%, 6% , 10% , 4% , 0% 0%
set_metric_value() exec'd cpu_idle_func (13)
set_metric_value() exec'd lwrite_sec_func (33)
set_metric_value() exec'd cpu_user_func (10)
cpustuff: Not enough space
offset = -20931796, cpu_now[1] = 347389
cpustuff: Not enough space
offset = -523107352, cpu_now[1] = 588141
Raw: bread / bwrite / lread / lwrite / phread / phwrite
110509,110509,0 / 372715,372700,15 / 5051982,5051912,70 /
939971,939912,59 / 34,34,0 / 0,0,0
Aftermath: 0.000000 0.937500 4.375000 3.687500 0.000000 0.000000 delta = 16
** ** ** ** ** Are percentages electric? Try -13%, 6% , 10% , 4% , 0% 0%
Notice the 'Permission denied' and 'Not enough space' errors.
I can't seem to find either of those errors in the ganglia source tree,
so they are apparently system errors. If I run the process under truss
(by attaching a truss to the process after it's started successfully),
the host is marked down within about 2 minutes, and I get the following
in the output:
recvfrom(1, "\0\0\01D D 8 8B0", 1472, 0, 0xDF605958, 0xDF604DD4) = 8
lwp_sema_post(0xDF504E6C) = 0
lwp_sema_wait(0xDF504E6C) = 0
gettimeofday(0xDF605378) = 0
pause() (sleeping...)
signotifywait() (sleeping...)
lwp_sema_wait(0xDF504E6C) (sleeping...)
accept(2, 0xDF302C4C, 0xDF302C44, 1) (sleeping...)
pread64(3, 0xDF100A08, 712, 0xFEC09B2C) (sleeping...)
lwp_cond_wait(0xDF8E95E8, 0xDF8E95D0, 0xDEC07D78) (sleeping...)
recvfrom(1, 0xDF605968, 1472, 0, 0xDF605958, 0xDF604DD4) (sleeping...)
lwp_cond_wait(0xDF8EFE10, 0xDF8EFE20, 0xDF8E9640) (sleeping...)
recvfrom(1, "\0\0\0\f >86 002", 1472, 0, 0xDF605958, 0xDF604DD4) = 8
lwp_sema_post(0xDF504E6C) = 0
lwp_sema_wait(0xDF504E6C) = 0
gettimeofday(0xDF605378) = 0
pause() (sleeping...)
signotifywait() (sleeping...)
lwp_sema_wait(0xDF504E6C) (sleeping...)
accept(2, 0xDF302C4C, 0xDF302C44, 1) (sleeping...)
pread64(3, 0xDF100A08, 712, 0xFEC09B2C) (sleeping...)
lwp_cond_wait(0xDF8E95E8, 0xDF8E95D0, 0xDEC07D78) (sleeping...)
recvfrom(1, 0xDF605968, 1472, 0, 0xDF605958, 0xDF604DD4) (sleeping...)
lwp_cond_wait(0xDF8EFE10, 0xDF8EFE20, 0xDF8E9640) (sleeping...)
recvfrom(1, "\0\0\01A AA3B5 !", 1472, 0, 0xDF605958, 0xDF604DD4) = 8
lwp_sema_post(0xDF504E6C) = 0
lwp_sema_wait(0xDF504E6C) = 0
gettimeofday(0xDF604D10) = 0
pause() (sleeping...)
signotifywait() (sleeping...)
lwp_sema_wait(0xDF504E6C) (sleeping...)
accept(2, 0xDF302C4C, 0xDF302C44, 1) (sleeping...)
pread64(3, 0xDF100A08, 712, 0xFEC09B2C) (sleeping...)
lwp_cond_wait(0xDF8E95E8, 0xDF8E95D0, 0xDEC07D78) (sleeping...)
recvfrom(1, 0xDF605968, 1472, 0, 0xDF605958, 0xDF604DD4) (sleeping...)
lwp_cond_wait(0xDF8EFE10, 0xDF8EFE20, 0xDF8E9640) (sleeping...)
It doesn't seem particularly useful, but hey, there is a '!' character
in there... :)
This isn't exactly life-threatening, since I am only testing at this
point and it still works fine in minimal debugging mode, but this seems
pretty weird.
The process is running as the 'nobody' user, and everything was compiled
with '--prefix=/usr/local --with-metad' using gcc 3.3.2.
Anyone have any ideas?