Hi all,
I'm running ganglia on a Debian node and a Solaris 8 x86 node to test
it, and I'm running into problems on the Solaris node. The Debian
node is currently the one running gmetad.
When I pring gconfd up on the solaris node, it works fine for a few
minutes, but very shortly it is reported as being down by the web
interface, even though it is very clearly not down and the process is
still running.
When I run the daemon with any kind of debugging, this problem never
occurs, but I get the following output in -d 1:
cpustuff: Permission denied
cpustuff: Permission denied
cpustuff: Permission denied
cpustuff: Permission denied
cpustuff: Permission denied
cpustuff: Permission denied
cpustuff: Not enough space
cpustuff: Not enough space
I just keep getting 'Not enough space' forever after that, but the
daemon continues to function just fine.
If I run with with -d 2 or higher, I get the following type of output
(this is multiple instances, pulled somewhat randomly):
set_metric_value() exec'd cpu_user_func (10)
cpustuff: Permission denied
offset = -20931796, cpu_now[1] = 346648
cpustuff: Permission denied
offset = -523107352, cpu_now[1] = 586301
Raw: bread / bwrite / lread / lwrite / phread / phwrite
110481,110481,0 / 372254,372241,13 / 5049919,5049897,22 /
938684,938649,35 / 34,34,0 / 0,0,0
Aftermath: 0.000000 0.812500 1.375000 2.187500 0.000000 0.000000
XDR data successfully sent
mcast_value() mcasting os_release value
encoded 12 XDR bytes
XDR data successfully sent
set_metric_value() exec'd cpu_user_func (10)
* * * * Setting alpha to 0.016667 and beta to 0.983333 because
timediff = 0
pre_process_node() received a new node
cpustuff: Permission denied
offset = -20931796, cpu_now[1] = 346731
cpustuff: Permission denied
offset = -523107352, cpu_now[1] = 586391
Raw: bread / bwrite / lread / lwrite / phread / phwrite
110481,0,110481 / 372291,0,372291 / 5049968,0,5049968 /
938746,0,938746 / 34,0,34 / 0,0,0
Aftermath: 0.000100 0.000338 0.004586 0.000852 0.000000 0.000000
delta = 1101251722
** ** ** ** ** Are percentages electric? Try -23%, -3% , 11% , 4% ,
0% 0%
mcast_value() mcasting cpu_user value
set_metric_value() exec'd cpu_user_func (10)
cpustuff: Permission denied
offset = -20931796, cpu_now[1] = 346752
cpustuff: Permission denied
offset = -523107352, cpu_now[1] = 586423
Raw: bread / bwrite / lread / lwrite / phread / phwrite
110481,110481,0 / 372317,372316,1 / 5050063,5050044,19 /
938796,938791,5 / 34,34,0 / 0,0,0
Aftermath: 0.000000 0.062500 1.187500 0.312500 0.000000 0.000000
delta = 16
** ** ** ** ** Are percentages electric? Try -13%, 6% , 10% , 4% ,
0% 0%
set_metric_value() exec'd bwrite_sec_func (31)
cpustuff: Not enough space
offset = -20931796, cpu_now[1] = 347152
cpustuff: Not enough space
offset = -523107352, cpu_now[1] = 587363
Raw: bread / bwrite / lread / lwrite / phread / phwrite
110500,110488,12 / 372398,372389,9 / 5050309,5050246,63 /
938981,938962,19 / 34,34,0 / 0,0,0
Aftermath: 0.750000 0.562500 3.937500 1.187500 0.000000 0.000000
delta = 16
** ** ** ** ** Are percentages electric? Try -13%, 6% , 10% , 4% ,
0% 0%
set_metric_value() exec'd cpu_idle_func (13)
set_metric_value() exec'd lwrite_sec_func (33)
set_metric_value() exec'd cpu_user_func (10)
cpustuff: Not enough space
offset = -20931796, cpu_now[1] = 347389
cpustuff: Not enough space
offset = -523107352, cpu_now[1] = 588141
Raw: bread / bwrite / lread / lwrite / phread / phwrite
110509,110509,0 / 372715,372700,15 / 5051982,5051912,70 /
939971,939912,59 / 34,34,0 / 0,0,0
Aftermath: 0.000000 0.937500 4.375000 3.687500 0.000000 0.000000
delta = 16
** ** ** ** ** Are percentages electric? Try -13%, 6% , 10% , 4% ,
0% 0%
Notice the 'Permission denied' and 'Not enough space' errors.
I can't seem to find either of those errors in the ganglia source
tree, so they are apparently system errors. If I run the process
under truss (by attaching a truss to the process after it's started
successfully), the host is marked down within about 2 minutes, and I
get the following in the output:
recvfrom(1, "\0\0\01D D 8 8B0", 1472, 0, 0xDF605958, 0xDF604DD4) = 8
lwp_sema_post(0xDF504E6C) = 0
lwp_sema_wait(0xDF504E6C) = 0
gettimeofday(0xDF605378) = 0
pause() (sleeping...)
signotifywait() (sleeping...)
lwp_sema_wait(0xDF504E6C) (sleeping...)
accept(2, 0xDF302C4C, 0xDF302C44, 1) (sleeping...)
pread64(3, 0xDF100A08, 712, 0xFEC09B2C) (sleeping...)
lwp_cond_wait(0xDF8E95E8, 0xDF8E95D0, 0xDEC07D78) (sleeping...)
recvfrom(1, 0xDF605968, 1472, 0, 0xDF605958, 0xDF604DD4) (sleeping...)
lwp_cond_wait(0xDF8EFE10, 0xDF8EFE20, 0xDF8E9640) (sleeping...)
recvfrom(1, "\0\0\0\f >86 002", 1472, 0, 0xDF605958, 0xDF604DD4) = 8
lwp_sema_post(0xDF504E6C) = 0
lwp_sema_wait(0xDF504E6C) = 0
gettimeofday(0xDF605378) = 0
pause() (sleeping...)
signotifywait() (sleeping...)
lwp_sema_wait(0xDF504E6C) (sleeping...)
accept(2, 0xDF302C4C, 0xDF302C44, 1) (sleeping...)
pread64(3, 0xDF100A08, 712, 0xFEC09B2C) (sleeping...)
lwp_cond_wait(0xDF8E95E8, 0xDF8E95D0, 0xDEC07D78) (sleeping...)
recvfrom(1, 0xDF605968, 1472, 0, 0xDF605958, 0xDF604DD4) (sleeping...)
lwp_cond_wait(0xDF8EFE10, 0xDF8EFE20, 0xDF8E9640) (sleeping...)
recvfrom(1, "\0\0\01A AA3B5 !", 1472, 0, 0xDF605958, 0xDF604DD4) = 8
lwp_sema_post(0xDF504E6C) = 0
lwp_sema_wait(0xDF504E6C) = 0
gettimeofday(0xDF604D10) = 0
pause() (sleeping...)
signotifywait() (sleeping...)
lwp_sema_wait(0xDF504E6C) (sleeping...)
accept(2, 0xDF302C4C, 0xDF302C44, 1) (sleeping...)
pread64(3, 0xDF100A08, 712, 0xFEC09B2C) (sleeping...)
lwp_cond_wait(0xDF8E95E8, 0xDF8E95D0, 0xDEC07D78) (sleeping...)
recvfrom(1, 0xDF605968, 1472, 0, 0xDF605958, 0xDF604DD4) (sleeping...)
lwp_cond_wait(0xDF8EFE10, 0xDF8EFE20, 0xDF8E9640) (sleeping...)
It doesn't seem particularly useful, but hey, there is a '!'
character in there... :)
This isn't exactly life-threatening, since I am only testing at this
point and it still works fine in minimal debugging mode, but this
seems pretty weird.
The process is running as the 'nobody' user, and everything was
compiled with '--prefix=/usr/local --with-metad' using gcc 3.3.2.
Anyone have any ideas?