Re: [Ganglia-general] problem with SummaryInfo

2005-11-07 Thread Martin Knoblauch
Hi Branimir,

 apparently Rick pushed you into the right direction already :-) Just a
few comments

Martin

--- Branimir Ackovic [EMAIL PROTECTED] wrote:

 
 Thank You Rick and Martin for quick response!
 
 I allready tried configuration that Rick suggest, but it doesn't
 work. In that configuration I see only one node per data_source
 (the last one). One week ago,  Michael Chang helped me to solve
 problem with this configuration:
 
 data_source AEGIS01-PHY-SCL1 147.91.83.201
 data_source AEGIS01-PHY-SCL2 147.91.83.202
 data_source AEGIS01-PHY-SCL3 147.91.83.203
 ...
 
 If I understand, Martin suggest that I need two machines with
 gmetad (one for each data_source). Now I have gmetad only on
 server with web frontend
 
 (se.phy.bg.ac.yu). 


 that is totally fine. You only need one gmetad running. Your problem
was that the nodes *within* your two *clusters* did not communicate
correctly. MCs setup allowed you to query each node individually, but
you lost the cluster concept that way.

 It is true that  the machines in the two groups do not see each
 other. Even in  same group. I tried:
 
 [EMAIL PROTECTED] root]# telnet localhost 8649 | grep grid
 Connection closed by foreign host.
 [EMAIL PROTECTED] root]#  
 
 Both machines ce and grid are in the same data_source with same
 gmond.conf files. As you said, Martin, I found the problem, but
 I don't found solution 
 for them. :(


 That was the most important step :-). Your gmond.conf files look
like a multicast setup, but apparently sometning went wrong. Possible
causes:

- no route for the multicast IP
- your switch does not like IGMP
- also, both of your clusters were talking on the same port. This can
be a problem with MC.

 So, going unicast is the right way to go in my opinion. Advantages
are:

- your networking infrastructure will not screw you up
- less network traffic. In a working multicast network you will have
N*N messages going around. In a large cluster that can be a lot of
traffic just for Ganglia.

Cheers
Martin


--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de



[Ganglia-general] problem with SummaryInfo

2005-11-04 Thread Branimir Ackovic
Hi,

I configured Ganglia 3.0.1 to monitor Grid site with 4 servers and 8 nodes. I 
put it in two groups: AEGIS01-PHY-SCL Core Services and AEGIS01-PHY-SCL
There is problem with summary report. I see only one node in each of this 
sources. I also have problem with grid summary because it use source summary.

You can see it on:
http://se.phy.bg.ac.yu/site/ganglia

How can I configure Ganglia to see all propertly.

All servers have in /etc/gmond.conf:

cluster {
  name = AEGIS01-PHY-SCL Core Services
}

and all nodes have in /etc/gmond.conf:
cluster {
  name = AEGIS01-PHY-SCL
}

There is gmetad and web frontend on one of servers 
(se.phy.bg.ac.yu/site/ganglia). In /etc/gmetad.conf I put:

data_source AEGIS01-PHY-SCL Core Services1 ce.phy.bg.ac.yu
data_source AEGIS01-PHY-SCL Core Services2 se.phy.bg.ac.yu
data_source AEGIS01-PHY-SCL Core Services3 grid.phy.bg.ac.yu
data_source AEGIS01-PHY-SCL Core Services4 rb.phy.bg.ac.yu
data_source AEGIS01-PHY-SCL1 wn01.phy.bg.ac.yu
data_source AEGIS01-PHY-SCL2 wn02.phy.bg.ac.yu
data_source AEGIS01-PHY-SCL3 wn03.phy.bg.ac.yu
data_source AEGIS01-PHY-SCL4 wn04.phy.bg.ac.yu
data_source AEGIS01-PHY-SCL5 wn05.phy.bg.ac.yu
data_source AEGIS01-PHY-SCL6 wn06.phy.bg.ac.yu
data_source AEGIS01-PHY-SCL7 wn07.phy.bg.ac.yu
data_source AEGIS01-PHY-SCL8 wn08.phy.bg.ac.yu

and 

gridname AEGIS01 PHY SCL

-
Branimir Ackovic
E-mail: [EMAIL PROTECTED]
Web: http://scl.phy.bg.ac.yu/

Phone: +381 11 3160260, Ext. 152
Fax: +381 11 3162190

Scientific Computing Laboratory
Institute of Physics, Belgrade
Serbia and Montenegro
-



Re: [Ganglia-general] problem with SummaryInfo

2005-11-04 Thread Martin Knoblauch
Hi Branimir,

 those servers look great. What are they? :-)

 Anyway, could you please post the two different gmond.conf files and
the gmetad.conf file?

 I have the impression that the machines in the two groups do not see
each other. At least one machine in each group should see the metrics
of its partner machines. In gmetad.conf you would use that machine as
data source. Basically you should only have two data sources in your
gmetad.conf

 Simple test. Log into one of the servers and do a telnet localhost
gmond-port. It should show you the data of all hosts in that group
(grep for HOST NAME). If it only shows its own data you have found
the problem.

Cheers
Martin

--- Branimir Ackovic [EMAIL PROTECTED] wrote:

 
 Hi,
 
 I configured Ganglia 3.0.1 to monitor Grid site with 4 servers and 8
 nodes. I 
 put it in two groups: AEGIS01-PHY-SCL Core Services and
 AEGIS01-PHY-SCL
 There is problem with summary report. I see only one node in each of
 this 
 sources. I also have problem with grid summary because it use source
 summary.
 
 You can see it on:
 http://se.phy.bg.ac.yu/site/ganglia
 
 How can I configure Ganglia to see all propertly.
 
 All servers have in /etc/gmond.conf:
 
 cluster {
   name = AEGIS01-PHY-SCL Core Services
 }
 
 and all nodes have in /etc/gmond.conf:
 cluster {
   name = AEGIS01-PHY-SCL
 }
 
 There is gmetad and web frontend on one of servers 
 (se.phy.bg.ac.yu/site/ganglia). In /etc/gmetad.conf I put:
 
 data_source AEGIS01-PHY-SCL Core Services1 ce.phy.bg.ac.yu
 data_source AEGIS01-PHY-SCL Core Services2 se.phy.bg.ac.yu
 data_source AEGIS01-PHY-SCL Core Services3 grid.phy.bg.ac.yu
 data_source AEGIS01-PHY-SCL Core Services4 rb.phy.bg.ac.yu
 data_source AEGIS01-PHY-SCL1 wn01.phy.bg.ac.yu
 data_source AEGIS01-PHY-SCL2 wn02.phy.bg.ac.yu
 data_source AEGIS01-PHY-SCL3 wn03.phy.bg.ac.yu
 data_source AEGIS01-PHY-SCL4 wn04.phy.bg.ac.yu
 data_source AEGIS01-PHY-SCL5 wn05.phy.bg.ac.yu
 data_source AEGIS01-PHY-SCL6 wn06.phy.bg.ac.yu
 data_source AEGIS01-PHY-SCL7 wn07.phy.bg.ac.yu
 data_source AEGIS01-PHY-SCL8 wn08.phy.bg.ac.yu
 
 and 
 
 gridname AEGIS01 PHY SCL
 
 -
 Branimir Ackovic
 E-mail: [EMAIL PROTECTED]
 Web: http://scl.phy.bg.ac.yu/
 
 Phone: +381 11 3160260, Ext. 152
 Fax: +381 11 3162190
 
 Scientific Computing Laboratory
 Institute of Physics, Belgrade
 Serbia and Montenegro
 -
 
 
 ---
 SF.Net email is sponsored by:
 Tame your development challenges with Apache's Geronimo App Server.
 Download
 it for free - -and be entered to win a 42 plasma tv or your very own
 Sony(tm)PSP.  Click here to play: http://sourceforge.net/geronimo.php
 ___
 Ganglia-general mailing list
 Ganglia-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/ganglia-general
 
 


--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de



Re: [Ganglia-general] problem with SummaryInfo

2005-11-04 Thread Rick Mohr

On Fri, 4 Nov 2005, Branimir Ackovic wrote:


All servers have in /etc/gmond.conf:

cluster {
 name = AEGIS01-PHY-SCL Core Services
}

and all nodes have in /etc/gmond.conf:
cluster {
 name = AEGIS01-PHY-SCL
}

There is gmetad and web frontend on one of servers
(se.phy.bg.ac.yu/site/ganglia). In /etc/gmetad.conf I put:

data_source AEGIS01-PHY-SCL Core Services1 ce.phy.bg.ac.yu
data_source AEGIS01-PHY-SCL Core Services2 se.phy.bg.ac.yu
data_source AEGIS01-PHY-SCL Core Services3 grid.phy.bg.ac.yu
data_source AEGIS01-PHY-SCL Core Services4 rb.phy.bg.ac.yu
data_source AEGIS01-PHY-SCL1 wn01.phy.bg.ac.yu
data_source AEGIS01-PHY-SCL2 wn02.phy.bg.ac.yu
data_source AEGIS01-PHY-SCL3 wn03.phy.bg.ac.yu
data_source AEGIS01-PHY-SCL4 wn04.phy.bg.ac.yu
data_source AEGIS01-PHY-SCL5 wn05.phy.bg.ac.yu
data_source AEGIS01-PHY-SCL6 wn06.phy.bg.ac.yu
data_source AEGIS01-PHY-SCL7 wn07.phy.bg.ac.yu
data_source AEGIS01-PHY-SCL8 wn08.phy.bg.ac.yu

and

gridname AEGIS01 PHY SCL


I believe the problem stems from the fact that the cluster names used on 
the data_source lines do not match the names defined in the gmond.conf 
files.  You may want to try something like this:


data_source AEGIS01-PHY-SCL Core Services ce.phy.bg.ac.yu \
se.phy.bg.ac.yu grid.phy.bg.ac.yu rb.phy.bg.ac.yu
data_source AEGIS01-PHY-SCL wn01.phy.bg.ac.yu wn02.phy.bg.ac.yu ...

-- Rick

--
Rick Mohr
Systems Developer
Ohio Supercomputer Center





Re: [Ganglia-general] problem with SummaryInfo

2005-11-04 Thread Branimir Ackovic

Thank You Rick and Martin for quick response!

I allready tried configuration that Rick suggest, but it doesn't work. In that 
configuration I see only one node per data_source (the last one). One week 
ago,  Michael Chang helped me to solve problem with this configuration:

data_source AEGIS01-PHY-SCL1 147.91.83.201
data_source AEGIS01-PHY-SCL2 147.91.83.202
data_source AEGIS01-PHY-SCL3 147.91.83.203
...

If I understand, Martin suggest that I need two machines with gmetad (one for 
each data_source). Now I have gmetad only on server with web frontend 
(se.phy.bg.ac.yu). 

It is true that  the machines in the two groups do not see each other. Even in 
same group. I tried:

[EMAIL PROTECTED] root]# telnet localhost 8649 | grep grid
Connection closed by foreign host.
[EMAIL PROTECTED] root]#  

Both machines ce and grid are in the same data_source with same gmond.conf 
files. As you said, Martin, I found the problem, but I don't found solution 
for them. :(

You can found my gmond and gmetad conf files in attachment.

-
Branimir Ackovic
E-mail: [EMAIL PROTECTED]
Web: http://scl.phy.bg.ac.yu/

Phone: +381 11 3160260, Ext. 152
Fax: +381 11 3162190

Scientific Computing Laboratory
Institute of Physics, Belgrade
Serbia and Montenegro
-

I believe the problem stems from the fact that the cluster names used on 
the data_source lines do not match the names defined in the gmond.conf 
files.  You may want to try something like this:

data_source AEGIS01-PHY-SCL Core Services ce.phy.bg.ac.yu \
 se.phy.bg.ac.yu grid.phy.bg.ac.yu rb.phy.bg.ac.yu
data_source AEGIS01-PHY-SCL wn01.phy.bg.ac.yu wn02.phy.bg.ac.yu ...

-- Rick





 Anyway, could you please post the two different gmond.conf files and
the gmetad.conf file?

 I have the impression that the machines in the two groups do not see
each other. At least one machine in each group should see the metrics
of its partner machines. In gmetad.conf you would use that machine as
data source. Basically you should only have two data sources in your
gmetad.conf

 Simple test. Log into one of the servers and do a telnet localhost
gmond-port. It should show you the data of all hosts in that group
(grep for HOST NAME). If it only shows its own data you have found
the problem.

Cheers
Martin


# This is an example of a Ganglia Meta Daemon configuration file
#http://ganglia.sourceforge.net/
#
# $Id: gmetad.conf,v 1.17 2005/03/15 18:15:05 massie Exp $
#
#---
# Setting the debug_level to 1 will keep daemon in the forground and
# show only error messages. Setting this value higher than 1 will make 
# gmetad output debugging information and stay in the foreground.
# default: 0
# debug_level 10
#
#---
# What to monitor. The most important section of this file. 
#
# The data_source tag specifies either a cluster or a grid to
# monitor. If we detect the source is a cluster, we will maintain a complete
# set of RRD databases for it, which can be used to create historical 
# graphs of the metrics. If the source is a grid (it comes from another gmetad),
# we will only maintain summary RRDs for it.
#
# Format: 
# data_source my cluster [polling interval] address1:port addreses2:port ...
# 
# The keyword 'data_source' must immediately be followed by a unique
# string which identifies the source, then an optional polling interval in 
# seconds. The source will be polled at this interval on average. 
# If the polling interval is omitted, 15sec is asssumed. 
#
# A list of machines which service the data source follows, in the 
# format ip:port, or name:port. If a port is not specified then 8649
# (the default gmond port) is assumed.
# default: There is no default value
#
# data_source my cluster 10 localhost  my.machine.edu:8649  1.2.3.5:8655
# data_source my grid 50 1.3.4.7:8655 grid.org:8651 grid-backup.org:8651
# data_source another source 1.3.4.7:8655  1.3.4.8

data_source AEGIS01-PHY-SCL Core Services1 147.91.83.217
data_source AEGIS01-PHY-SCL Core Services2 147.91.83.218
data_source AEGIS01-PHY-SCL Core Services3 147.91.83.219
data_source AEGIS01-PHY-SCL Core Services4 147.91.83.220
data_source AEGIS01-PHY-SCL1 147.91.83.201
data_source AEGIS01-PHY-SCL2 147.91.83.202
data_source AEGIS01-PHY-SCL3 147.91.83.203
data_source AEGIS01-PHY-SCL4 147.91.83.204
data_source AEGIS01-PHY-SCL5 147.91.83.205
data_source AEGIS01-PHY-SCL6 147.91.83.206
data_source AEGIS01-PHY-SCL7 147.91.83.207
data_source AEGIS01-PHY-SCL8 147.91.83.208


#
# Round-Robin Archives
# You can specify custom Round-Robin archives here (defaults are listed below)
#
# RRAs RRA:AVERAGE:0.5:1:240 RRA:AVERAGE:0.5:24:240 
RRA:AVERAGE:0.5:168:240 RRA:AVERAGE:0.5:672:240 \
#  RRA:AVERAGE:0.5:5760:370
#

#
#---
# Scalability mode. If on, we summarize over 

Re: [Ganglia-general] problem with SummaryInfo

2005-11-04 Thread Ian Cunningham

Branimir,

It seems as though multicast is not working. I would recommend going to 
unicast.


For your clusters this would mean picking 2 head nodes per cluster:

in gmond.conf_AEGIS01-PHY-SCL:

cluster {
 name = AEGIS01-PHY-SCL
}
udp_send_channel {
 host = wn01.phy.bg.ac.yu
 port = 8649
}
udp_send_channel {
 host = wn02.phy.bg.ac.yu
 port = 8649
}
udp_recv_channel {
 port = 8649
}

in gmond.conf_AEGIS01-PHY-SCL_Core_Services:

cluster {
 name = AEGIS01-PHY-SCL Core Services
}
udp_send_channel {
 host = se.phy.bg.ac.yu
 port = 8649
}
udp_send_channel {
 host = rb.phy.bg.ac.yu
 port = 8649
}
udp_recv_channel {
 port = 8649
}

in gmetad.conf running on se.phy.bg.ac.yu:

data_source AEGIS01-PHY-SCL Core Services se.phy.bg.ac.yu   rb.phy.bg.ac.yu
data_source AEGIS01-PHY-SCL wn01.phy.bg.ac.yu   wn02.phy.bg.ac.yu


If you can get multicast working, good luck, but it can be hard. Unicast 
is easy. To test the setup I gave you, telnet to a headnode (wn01, wn0 
2, se, rb) on either cluster `telnet wn01.phy.bg.ac.yu 8649 | grep 'HOST 
NAME='`.


Do not forget to restart all the gmonds and the gmetad processes after 
making changes to your configuration files.


Good Luck,
Ian

Branimir Ackovic wrote:


Thank You Rick and Martin for quick response!

I allready tried configuration that Rick suggest, but it doesn't work. In that 
configuration I see only one node per data_source (the last one). One week 
ago,  Michael Chang helped me to solve problem with this configuration:


data_source AEGIS01-PHY-SCL1 147.91.83.201
data_source AEGIS01-PHY-SCL2 147.91.83.202
data_source AEGIS01-PHY-SCL3 147.91.83.203
...

If I understand, Martin suggest that I need two machines with gmetad (one for 
each data_source). Now I have gmetad only on server with web frontend 
(se.phy.bg.ac.yu). 

It is true that  the machines in the two groups do not see each other. Even in 
same group. I tried:


[EMAIL PROTECTED] root]# telnet localhost 8649 | grep grid
Connection closed by foreign host.
[EMAIL PROTECTED] root]#  

Both machines ce and grid are in the same data_source with same gmond.conf 
files. As you said, Martin, I found the problem, but I don't found solution 
for them. :(


You can found my gmond and gmetad conf files in attachment.

-
Branimir Ackovic
E-mail: [EMAIL PROTECTED]
Web: http://scl.phy.bg.ac.yu/

Phone: +381 11 3160260, Ext. 152
Fax: +381 11 3162190

Scientific Computing Laboratory
Institute of Physics, Belgrade
Serbia and Montenegro
-
 






Re: [Ganglia-general] problem with SummaryInfo

2005-11-04 Thread Branimir Ackovic
That's it! Thanks.

pozdrav
Acko

-
Branimir Ackovic
E-mail: [EMAIL PROTECTED]
Web: http://scl.phy.bg.ac.yu/

Phone: +381 11 3160260, Ext. 152
Fax: +381 11 3162190

Scientific Computing Laboratory
Institute of Physics, Belgrade
Serbia and Montenegro
-


On Friday 04 November 2005 20:03, you wrote:
 Branimir,

 It seems as though multicast is not working. I would recommend going to
 unicast.

 For your clusters this would mean picking 2 head nodes per cluster:

 in gmond.conf_AEGIS01-PHY-SCL:

 cluster {
   name = AEGIS01-PHY-SCL
 }
 udp_send_channel {
   host = wn01.phy.bg.ac.yu
   port = 8649
 }
 udp_send_channel {
   host = wn02.phy.bg.ac.yu
   port = 8649
 }
 udp_recv_channel {
   port = 8649
 }

 in gmond.conf_AEGIS01-PHY-SCL_Core_Services:

 cluster {
   name = AEGIS01-PHY-SCL Core Services
 }
 udp_send_channel {
   host = se.phy.bg.ac.yu
   port = 8649
 }
 udp_send_channel {
   host = rb.phy.bg.ac.yu
   port = 8649
 }
 udp_recv_channel {
   port = 8649
 }

 in gmetad.conf running on se.phy.bg.ac.yu:

 data_source AEGIS01-PHY-SCL Core Services se.phy.bg.ac.yu  
 rb.phy.bg.ac.yu data_source AEGIS01-PHY-SCL wn01.phy.bg.ac.yu  
 wn02.phy.bg.ac.yu


 If you can get multicast working, good luck, but it can be hard. Unicast
 is easy. To test the setup I gave you, telnet to a headnode (wn01, wn0
 2, se, rb) on either cluster `telnet wn01.phy.bg.ac.yu 8649 | grep 'HOST
 NAME='`.

 Do not forget to restart all the gmonds and the gmetad processes after
 making changes to your configuration files.

 Good Luck,
 Ian

 Branimir Ackovic wrote:
 Thank You Rick and Martin for quick response!
 
 I allready tried configuration that Rick suggest, but it doesn't work. In
  that configuration I see only one node per data_source (the last one).
  One week ago,  Michael Chang helped me to solve problem with this
  configuration:
 
 data_source AEGIS01-PHY-SCL1 147.91.83.201
 data_source AEGIS01-PHY-SCL2 147.91.83.202
 data_source AEGIS01-PHY-SCL3 147.91.83.203
 ...
 
 If I understand, Martin suggest that I need two machines with gmetad (one
  for each data_source). Now I have gmetad only on server with web frontend
  (se.phy.bg.ac.yu).
 
 It is true that  the machines in the two groups do not see each other.
  Even in same group. I tried:
 
 [EMAIL PROTECTED] root]# telnet localhost 8649 | grep grid
 Connection closed by foreign host.
 [EMAIL PROTECTED] root]#
 
 Both machines ce and grid are in the same data_source with same gmond.conf
 files. As you said, Martin, I found the problem, but I don't found
  solution for them. :(
 
 You can found my gmond and gmetad conf files in attachment.
 
 -
 Branimir Ackovic
 E-mail: [EMAIL PROTECTED]
 Web: http://scl.phy.bg.ac.yu/
 
 Phone: +381 11 3160260, Ext. 152
 Fax: +381 11 3162190
 
 Scientific Computing Laboratory
 Institute of Physics, Belgrade
 Serbia and Montenegro
 -


pozdrav
Acko

-
Branimir Ackovic
E-mail: [EMAIL PROTECTED]
Web: http://scl.phy.bg.ac.yu/

Phone: +381 11 3160260, Ext. 152
Fax: +381 11 3162190

Scientific Computing Laboratory
Institute of Physics, Belgrade
Serbia and Montenegro
-



Re: [Ganglia-general] problem with SummaryInfo

2005-11-04 Thread michael chang
On 11/4/05, Ian Cunningham [EMAIL PROTECTED] wrote:
 If you can get multicast working, good luck, but it can be hard. Unicast
 is easy. To test the setup I gave you, telnet to a headnode (wn01, wn0
 2, se, rb) on either cluster `telnet wn01.phy.bg.ac.yu 8649 | grep 'HOST
 NAME='`.

One thing I noticed, that while it incurs a performance hit, I have
run Ganglia monitoring over an OpenVPN tunnel, with multicast, when I
configure OpenVPN to allow peers to see each other.  (Since I lack an
actual physical network to connect all the computers I monitor, which
are at different physical locations, this is the only way for me to do
it.)  It's a bit peculiar, and if unicast works, then that's also
great.  Just wanted to mention that this way also works for me.

And BTW, sorry for misleading him about individually listing the
individual machines - I forgot about the unicast support that was
recently added.  (Should've checked.)

--
~Mike
 - Just my two cents
 - No man is an island, and no man is unable.