Hi There.

Just to follow up Stevens comments, the Solaris kvm dependency was removed some time ago using the kstat interface that does not require root code. It should be in the final version of 2.5.8 and I'm eagerly awaiting it's arrival.

------
Yemi

On Nov 23, 2004, at 4:15 PM, steven wagner wrote:

Luke A. Kanies wrote:
Hi all,
I'm running ganglia on a Debian node and a Solaris 8 x86 node to test it, and I'm running into problems on the Solaris node. The Debian node is currently the one running gmetad. When I pring gconfd up on the solaris node, it works fine for a few minutes, but very shortly it is reported as being down by the web interface, even though it is very clearly not down and the process is still running. When I run the daemon with any kind of debugging, this problem never occurs, but I get the following output in -d 1:
cpustuff: Permission denied
cpustuff: Permission denied
cpustuff: Permission denied
cpustuff: Permission denied
cpustuff: Permission denied
cpustuff: Permission denied
cpustuff: Not enough space
cpustuff: Not enough space
I just keep getting 'Not enough space' forever after that, but the daemon continues to function just fine. If I run with with -d 2 or higher, I get the following type of output (this is multiple instances, pulled somewhat randomly):
set_metric_value() exec'd cpu_user_func (10)
cpustuff: Permission denied
offset = -20931796, cpu_now[1] = 346648
cpustuff: Permission denied
offset = -523107352, cpu_now[1] = 586301
Raw:  bread / bwrite / lread / lwrite / phread / phwrite
110481,110481,0 / 372254,372241,13 / 5049919,5049897,22 / 938684,938649,35 / 34,34,0 / 0,0,0
Aftermath: 0.000000 0.812500 1.375000 2.187500 0.000000 0.000000
XDR data successfully sent
mcast_value() mcasting os_release value
encoded 12 XDR bytes
XDR data successfully sent
set_metric_value() exec'd cpu_user_func (10)
* * * * Setting alpha to 0.016667 and beta to 0.983333 because timediff = 0
pre_process_node() received a new node
cpustuff: Permission denied
offset = -20931796, cpu_now[1] = 346731
cpustuff: Permission denied
offset = -523107352, cpu_now[1] = 586391
Raw:  bread / bwrite / lread / lwrite / phread / phwrite
110481,0,110481 / 372291,0,372291 / 5049968,0,5049968 / 938746,0,938746 / 34,0,34 / 0,0,0 Aftermath: 0.000100 0.000338 0.004586 0.000852 0.000000 0.000000 delta = 1101251722 ** ** ** ** ** Are percentages electric? Try -23%, -3% , 11% , 4% , 0% 0%
mcast_value() mcasting cpu_user value
set_metric_value() exec'd cpu_user_func (10)
cpustuff: Permission denied
offset = -20931796, cpu_now[1] = 346752
cpustuff: Permission denied
offset = -523107352, cpu_now[1] = 586423
Raw:  bread / bwrite / lread / lwrite / phread / phwrite
110481,110481,0 / 372317,372316,1 / 5050063,5050044,19 / 938796,938791,5 / 34,34,0 / 0,0,0 Aftermath: 0.000000 0.062500 1.187500 0.312500 0.000000 0.000000 delta = 16 ** ** ** ** ** Are percentages electric? Try -13%, 6% , 10% , 4% , 0% 0%
set_metric_value() exec'd bwrite_sec_func (31)
cpustuff: Not enough space
offset = -20931796, cpu_now[1] = 347152
cpustuff: Not enough space
offset = -523107352, cpu_now[1] = 587363
Raw:  bread / bwrite / lread / lwrite / phread / phwrite
110500,110488,12 / 372398,372389,9 / 5050309,5050246,63 / 938981,938962,19 / 34,34,0 / 0,0,0 Aftermath: 0.750000 0.562500 3.937500 1.187500 0.000000 0.000000 delta = 16 ** ** ** ** ** Are percentages electric? Try -13%, 6% , 10% , 4% , 0% 0%
set_metric_value() exec'd cpu_idle_func (13)
set_metric_value() exec'd lwrite_sec_func (33)
set_metric_value() exec'd cpu_user_func (10)
cpustuff: Not enough space
offset = -20931796, cpu_now[1] = 347389
cpustuff: Not enough space
offset = -523107352, cpu_now[1] = 588141
Raw:  bread / bwrite / lread / lwrite / phread / phwrite
110509,110509,0 / 372715,372700,15 / 5051982,5051912,70 / 939971,939912,59 / 34,34,0 / 0,0,0 Aftermath: 0.000000 0.937500 4.375000 3.687500 0.000000 0.000000 delta = 16 ** ** ** ** ** Are percentages electric? Try -13%, 6% , 10% , 4% , 0% 0%
Notice the 'Permission denied' and 'Not enough space' errors.
I can't seem to find either of those errors in the ganglia source tree, so they are apparently system errors. If I run the process under truss (by attaching a truss to the process after it's started successfully), the host is marked down within about 2 minutes, and I get the following in the output:
recvfrom(1, "\0\0\01D D 8 8B0", 1472, 0, 0xDF605958, 0xDF604DD4) = 8
lwp_sema_post(0xDF504E6C)                       = 0
lwp_sema_wait(0xDF504E6C)                       = 0
gettimeofday(0xDF605378)                        = 0
pause()                         (sleeping...)
signotifywait()                 (sleeping...)
lwp_sema_wait(0xDF504E6C)       (sleeping...)
accept(2, 0xDF302C4C, 0xDF302C44, 1) (sleeping...)
pread64(3, 0xDF100A08, 712, 0xFEC09B2C) (sleeping...)
lwp_cond_wait(0xDF8E95E8, 0xDF8E95D0, 0xDEC07D78) (sleeping...)
recvfrom(1, 0xDF605968, 1472, 0, 0xDF605958, 0xDF604DD4) (sleeping...)
lwp_cond_wait(0xDF8EFE10, 0xDF8EFE20, 0xDF8E9640) (sleeping...)
recvfrom(1, "\0\0\0\f >86 002", 1472, 0, 0xDF605958, 0xDF604DD4) = 8
lwp_sema_post(0xDF504E6C)                       = 0
lwp_sema_wait(0xDF504E6C)                       = 0
gettimeofday(0xDF605378)                        = 0
pause()                         (sleeping...)
signotifywait()                 (sleeping...)
lwp_sema_wait(0xDF504E6C)       (sleeping...)
accept(2, 0xDF302C4C, 0xDF302C44, 1) (sleeping...)
pread64(3, 0xDF100A08, 712, 0xFEC09B2C) (sleeping...)
lwp_cond_wait(0xDF8E95E8, 0xDF8E95D0, 0xDEC07D78) (sleeping...)
recvfrom(1, 0xDF605968, 1472, 0, 0xDF605958, 0xDF604DD4) (sleeping...)
lwp_cond_wait(0xDF8EFE10, 0xDF8EFE20, 0xDF8E9640) (sleeping...)
recvfrom(1, "\0\0\01A AA3B5 !", 1472, 0, 0xDF605958, 0xDF604DD4) = 8
lwp_sema_post(0xDF504E6C)                       = 0
lwp_sema_wait(0xDF504E6C)                       = 0
gettimeofday(0xDF604D10)                        = 0
pause()                         (sleeping...)
signotifywait()                 (sleeping...)
lwp_sema_wait(0xDF504E6C)       (sleeping...)
accept(2, 0xDF302C4C, 0xDF302C44, 1) (sleeping...)
pread64(3, 0xDF100A08, 712, 0xFEC09B2C) (sleeping...)
lwp_cond_wait(0xDF8E95E8, 0xDF8E95D0, 0xDEC07D78) (sleeping...)
recvfrom(1, 0xDF605968, 1472, 0, 0xDF605958, 0xDF604DD4) (sleeping...)
lwp_cond_wait(0xDF8EFE10, 0xDF8EFE20, 0xDF8E9640) (sleeping...)
It doesn't seem particularly useful, but hey, there is a '!' character in there... :) This isn't exactly life-threatening, since I am only testing at this point and it still works fine in minimal debugging mode, but this seems pretty weird. The process is running as the 'nobody' user, and everything was compiled with '--prefix=/usr/local --with-metad' using gcc 3.3.2.
Anyone have any ideas?

[delurk]

*waves cane around in what he hopes is a menacing gesture, but looks really pathetic*

Back when I wrote that terrible chunk of code for Solaris SPARC, it needed to be run as root. I'm pretty sure other people - people more accustomed to developing in Solaris than I - have gone over the code since then and achieved at least a partial move away from the kvm_read calls, but I think the root dependency may still be there.

I'm pretty sure I put in some warnings early on in high-debuglevel for the monitoring core - if it can't open the kernel symbol table, it starts screeching about how it's probably going to crash in a few seconds.

Once again, I hope somebody fixed that, too.

I'd care more, but I don't have to worry about Alphas or SPARC boxes anymore. Yay! :)

[back to lurking...]



-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://productguide.itmanagersjournal.com/
_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Reply via email to