I don't use gstat. It was broken when I first dared to attempt my Solaris port and I never bothered to fix it once I realized that it was basically a command-line app that connects to localhost:8649 and parses the output into an ASCII table.

At first I thought the problem was that *gmetric* wasn't working ... whew!

Does the debug value mean anything, Matt? In all the debug_msg's that I've written, none of them seem to support verbosity weights, service masks, or what-have-you. Just wondering if I'm gonna have to go back and change stuff if this is on the to-do list. :)

I would *guess* that gstat is attempting to use the Solaris socket library in a Linux manner and the result may not be kosher (I halfheartedly tossed a binary together and unleashed it on a fileserver):

> sol.gstat
tcp_connect() setsockopt() TCP_NODELAY error: Invalid argument
Unable to get hostlist from 127.0.0.1 8649!

Looks like gmond's still up.

As a workaround you could hack a Linux gstat into connecting to solaris_host:8649 - it would be interesting if that worked.

That's all I got, though. :)

matt massie wrote:
james-

you are running on solaris right? i haven't heard of this problem before on solaris. steve wagner is our resident ganglia solaris guru so he might know something that i don't. unless we hear from steve, i'll add it to the bug list and try to determine exactly what's going on here.

again... thanks so much for the feedback!
-matt

Today, James Braid wrote forth saying...


that is the same problem. heartbeat messages are sent every 15 seconds so if a machine doesn't get a heartbeat message in 60 seconds (4 missed heartbeats) it assumes it is down. if you use the latest CVS source you should see the problem no longer is there. let me know otherwise.

Just checked out the latest CVS (at 13:06 NZST). It looks like that
problem is fixed. The REPORTED value looks like its incrementing as it
should....sweet!

BUT

Gstat doesn't like talking to gmond (this happens whether I try to
connect using gstat on the local machine, or remotely...

The following is from gmond running with debug = 6

<snip>
4 pre_process_node() remote_ip=10.0.1.130
pre_process_node() HOSTNAME =tycho.peace.co.nz
pre_process_node() TIMESTAMP=1028682333
pre_process_node() HASHP    =100128bc0
pre_process_node() USER_HASHP=1001290e0
pre_process_node() returning the ganglia internal hash pointer 100128bc0
mcast_listen_thread() got internal hash 100128bc0
mcast_listen_thread() built metricdata struct
mcast_listen_thread() attempting to hash_insert_data
mcast_listen_thread() inserted data into 100128bc0
server_thread() 6 clientfd = 11

sent data to host 127.0.0.1
Broken Pipe
</snip>

However, telnetting to the port works fine (and I can do this as much as
I like and gmond stays alive, just running gstat kills it straight
away):

lilo:~# telnet tycho 8649
Trying 10.0.1.130...
Connected to tycho.peace.co.nz.
Escape character is '^]'.
<?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>
<!DOCTYPE GANGLIA_XML [
  <!ELEMENT GANGLIA_XML (CLUSTER)+>
  <!ATTLIST GANGLIA_XML VERSION CDATA #REQUIRED
                        SOURCE  CDATA #REQUIRED>

Etc, etc, etc (all the stats are there)...

Any ideas about this one?

Thanks for the prompt and helpful answers

James



-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Ganglia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-developers





-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Ganglia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-developers




Reply via email to