Re: [Ganglia-general] problem with SummaryInfo
Hi Branimir, apparently Rick pushed you into the right direction already :-) Just a few comments Martin --- Branimir Ackovic [EMAIL PROTECTED] wrote: Thank You Rick and Martin for quick response! I allready tried configuration that Rick suggest, but it doesn't work. In that configuration I see only one node per data_source (the last one). One week ago, Michael Chang helped me to solve problem with this configuration: data_source AEGIS01-PHY-SCL1 147.91.83.201 data_source AEGIS01-PHY-SCL2 147.91.83.202 data_source AEGIS01-PHY-SCL3 147.91.83.203 ... If I understand, Martin suggest that I need two machines with gmetad (one for each data_source). Now I have gmetad only on server with web frontend (se.phy.bg.ac.yu). that is totally fine. You only need one gmetad running. Your problem was that the nodes *within* your two *clusters* did not communicate correctly. MCs setup allowed you to query each node individually, but you lost the cluster concept that way. It is true that the machines in the two groups do not see each other. Even in same group. I tried: [EMAIL PROTECTED] root]# telnet localhost 8649 | grep grid Connection closed by foreign host. [EMAIL PROTECTED] root]# Both machines ce and grid are in the same data_source with same gmond.conf files. As you said, Martin, I found the problem, but I don't found solution for them. :( That was the most important step :-). Your gmond.conf files look like a multicast setup, but apparently sometning went wrong. Possible causes: - no route for the multicast IP - your switch does not like IGMP - also, both of your clusters were talking on the same port. This can be a problem with MC. So, going unicast is the right way to go in my opinion. Advantages are: - your networking infrastructure will not screw you up - less network traffic. In a working multicast network you will have N*N messages going around. In a large cluster that can be a lot of traffic just for Ganglia. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
[Ganglia-general] problem with SummaryInfo
Hi, I configured Ganglia 3.0.1 to monitor Grid site with 4 servers and 8 nodes. I put it in two groups: AEGIS01-PHY-SCL Core Services and AEGIS01-PHY-SCL There is problem with summary report. I see only one node in each of this sources. I also have problem with grid summary because it use source summary. You can see it on: http://se.phy.bg.ac.yu/site/ganglia How can I configure Ganglia to see all propertly. All servers have in /etc/gmond.conf: cluster { name = AEGIS01-PHY-SCL Core Services } and all nodes have in /etc/gmond.conf: cluster { name = AEGIS01-PHY-SCL } There is gmetad and web frontend on one of servers (se.phy.bg.ac.yu/site/ganglia). In /etc/gmetad.conf I put: data_source AEGIS01-PHY-SCL Core Services1 ce.phy.bg.ac.yu data_source AEGIS01-PHY-SCL Core Services2 se.phy.bg.ac.yu data_source AEGIS01-PHY-SCL Core Services3 grid.phy.bg.ac.yu data_source AEGIS01-PHY-SCL Core Services4 rb.phy.bg.ac.yu data_source AEGIS01-PHY-SCL1 wn01.phy.bg.ac.yu data_source AEGIS01-PHY-SCL2 wn02.phy.bg.ac.yu data_source AEGIS01-PHY-SCL3 wn03.phy.bg.ac.yu data_source AEGIS01-PHY-SCL4 wn04.phy.bg.ac.yu data_source AEGIS01-PHY-SCL5 wn05.phy.bg.ac.yu data_source AEGIS01-PHY-SCL6 wn06.phy.bg.ac.yu data_source AEGIS01-PHY-SCL7 wn07.phy.bg.ac.yu data_source AEGIS01-PHY-SCL8 wn08.phy.bg.ac.yu and gridname AEGIS01 PHY SCL - Branimir Ackovic E-mail: [EMAIL PROTECTED] Web: http://scl.phy.bg.ac.yu/ Phone: +381 11 3160260, Ext. 152 Fax: +381 11 3162190 Scientific Computing Laboratory Institute of Physics, Belgrade Serbia and Montenegro -
Re: [Ganglia-general] problem with SummaryInfo
Hi Branimir, those servers look great. What are they? :-) Anyway, could you please post the two different gmond.conf files and the gmetad.conf file? I have the impression that the machines in the two groups do not see each other. At least one machine in each group should see the metrics of its partner machines. In gmetad.conf you would use that machine as data source. Basically you should only have two data sources in your gmetad.conf Simple test. Log into one of the servers and do a telnet localhost gmond-port. It should show you the data of all hosts in that group (grep for HOST NAME). If it only shows its own data you have found the problem. Cheers Martin --- Branimir Ackovic [EMAIL PROTECTED] wrote: Hi, I configured Ganglia 3.0.1 to monitor Grid site with 4 servers and 8 nodes. I put it in two groups: AEGIS01-PHY-SCL Core Services and AEGIS01-PHY-SCL There is problem with summary report. I see only one node in each of this sources. I also have problem with grid summary because it use source summary. You can see it on: http://se.phy.bg.ac.yu/site/ganglia How can I configure Ganglia to see all propertly. All servers have in /etc/gmond.conf: cluster { name = AEGIS01-PHY-SCL Core Services } and all nodes have in /etc/gmond.conf: cluster { name = AEGIS01-PHY-SCL } There is gmetad and web frontend on one of servers (se.phy.bg.ac.yu/site/ganglia). In /etc/gmetad.conf I put: data_source AEGIS01-PHY-SCL Core Services1 ce.phy.bg.ac.yu data_source AEGIS01-PHY-SCL Core Services2 se.phy.bg.ac.yu data_source AEGIS01-PHY-SCL Core Services3 grid.phy.bg.ac.yu data_source AEGIS01-PHY-SCL Core Services4 rb.phy.bg.ac.yu data_source AEGIS01-PHY-SCL1 wn01.phy.bg.ac.yu data_source AEGIS01-PHY-SCL2 wn02.phy.bg.ac.yu data_source AEGIS01-PHY-SCL3 wn03.phy.bg.ac.yu data_source AEGIS01-PHY-SCL4 wn04.phy.bg.ac.yu data_source AEGIS01-PHY-SCL5 wn05.phy.bg.ac.yu data_source AEGIS01-PHY-SCL6 wn06.phy.bg.ac.yu data_source AEGIS01-PHY-SCL7 wn07.phy.bg.ac.yu data_source AEGIS01-PHY-SCL8 wn08.phy.bg.ac.yu and gridname AEGIS01 PHY SCL - Branimir Ackovic E-mail: [EMAIL PROTECTED] Web: http://scl.phy.bg.ac.yu/ Phone: +381 11 3160260, Ext. 152 Fax: +381 11 3162190 Scientific Computing Laboratory Institute of Physics, Belgrade Serbia and Montenegro - --- SF.Net email is sponsored by: Tame your development challenges with Apache's Geronimo App Server. Download it for free - -and be entered to win a 42 plasma tv or your very own Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] problem with SummaryInfo
On Fri, 4 Nov 2005, Branimir Ackovic wrote: All servers have in /etc/gmond.conf: cluster { name = AEGIS01-PHY-SCL Core Services } and all nodes have in /etc/gmond.conf: cluster { name = AEGIS01-PHY-SCL } There is gmetad and web frontend on one of servers (se.phy.bg.ac.yu/site/ganglia). In /etc/gmetad.conf I put: data_source AEGIS01-PHY-SCL Core Services1 ce.phy.bg.ac.yu data_source AEGIS01-PHY-SCL Core Services2 se.phy.bg.ac.yu data_source AEGIS01-PHY-SCL Core Services3 grid.phy.bg.ac.yu data_source AEGIS01-PHY-SCL Core Services4 rb.phy.bg.ac.yu data_source AEGIS01-PHY-SCL1 wn01.phy.bg.ac.yu data_source AEGIS01-PHY-SCL2 wn02.phy.bg.ac.yu data_source AEGIS01-PHY-SCL3 wn03.phy.bg.ac.yu data_source AEGIS01-PHY-SCL4 wn04.phy.bg.ac.yu data_source AEGIS01-PHY-SCL5 wn05.phy.bg.ac.yu data_source AEGIS01-PHY-SCL6 wn06.phy.bg.ac.yu data_source AEGIS01-PHY-SCL7 wn07.phy.bg.ac.yu data_source AEGIS01-PHY-SCL8 wn08.phy.bg.ac.yu and gridname AEGIS01 PHY SCL I believe the problem stems from the fact that the cluster names used on the data_source lines do not match the names defined in the gmond.conf files. You may want to try something like this: data_source AEGIS01-PHY-SCL Core Services ce.phy.bg.ac.yu \ se.phy.bg.ac.yu grid.phy.bg.ac.yu rb.phy.bg.ac.yu data_source AEGIS01-PHY-SCL wn01.phy.bg.ac.yu wn02.phy.bg.ac.yu ... -- Rick -- Rick Mohr Systems Developer Ohio Supercomputer Center
Re: [Ganglia-general] problem with SummaryInfo
Thank You Rick and Martin for quick response! I allready tried configuration that Rick suggest, but it doesn't work. In that configuration I see only one node per data_source (the last one). One week ago, Michael Chang helped me to solve problem with this configuration: data_source AEGIS01-PHY-SCL1 147.91.83.201 data_source AEGIS01-PHY-SCL2 147.91.83.202 data_source AEGIS01-PHY-SCL3 147.91.83.203 ... If I understand, Martin suggest that I need two machines with gmetad (one for each data_source). Now I have gmetad only on server with web frontend (se.phy.bg.ac.yu). It is true that the machines in the two groups do not see each other. Even in same group. I tried: [EMAIL PROTECTED] root]# telnet localhost 8649 | grep grid Connection closed by foreign host. [EMAIL PROTECTED] root]# Both machines ce and grid are in the same data_source with same gmond.conf files. As you said, Martin, I found the problem, but I don't found solution for them. :( You can found my gmond and gmetad conf files in attachment. - Branimir Ackovic E-mail: [EMAIL PROTECTED] Web: http://scl.phy.bg.ac.yu/ Phone: +381 11 3160260, Ext. 152 Fax: +381 11 3162190 Scientific Computing Laboratory Institute of Physics, Belgrade Serbia and Montenegro - I believe the problem stems from the fact that the cluster names used on the data_source lines do not match the names defined in the gmond.conf files. You may want to try something like this: data_source AEGIS01-PHY-SCL Core Services ce.phy.bg.ac.yu \ se.phy.bg.ac.yu grid.phy.bg.ac.yu rb.phy.bg.ac.yu data_source AEGIS01-PHY-SCL wn01.phy.bg.ac.yu wn02.phy.bg.ac.yu ... -- Rick Anyway, could you please post the two different gmond.conf files and the gmetad.conf file? I have the impression that the machines in the two groups do not see each other. At least one machine in each group should see the metrics of its partner machines. In gmetad.conf you would use that machine as data source. Basically you should only have two data sources in your gmetad.conf Simple test. Log into one of the servers and do a telnet localhost gmond-port. It should show you the data of all hosts in that group (grep for HOST NAME). If it only shows its own data you have found the problem. Cheers Martin # This is an example of a Ganglia Meta Daemon configuration file #http://ganglia.sourceforge.net/ # # $Id: gmetad.conf,v 1.17 2005/03/15 18:15:05 massie Exp $ # #--- # Setting the debug_level to 1 will keep daemon in the forground and # show only error messages. Setting this value higher than 1 will make # gmetad output debugging information and stay in the foreground. # default: 0 # debug_level 10 # #--- # What to monitor. The most important section of this file. # # The data_source tag specifies either a cluster or a grid to # monitor. If we detect the source is a cluster, we will maintain a complete # set of RRD databases for it, which can be used to create historical # graphs of the metrics. If the source is a grid (it comes from another gmetad), # we will only maintain summary RRDs for it. # # Format: # data_source my cluster [polling interval] address1:port addreses2:port ... # # The keyword 'data_source' must immediately be followed by a unique # string which identifies the source, then an optional polling interval in # seconds. The source will be polled at this interval on average. # If the polling interval is omitted, 15sec is asssumed. # # A list of machines which service the data source follows, in the # format ip:port, or name:port. If a port is not specified then 8649 # (the default gmond port) is assumed. # default: There is no default value # # data_source my cluster 10 localhost my.machine.edu:8649 1.2.3.5:8655 # data_source my grid 50 1.3.4.7:8655 grid.org:8651 grid-backup.org:8651 # data_source another source 1.3.4.7:8655 1.3.4.8 data_source AEGIS01-PHY-SCL Core Services1 147.91.83.217 data_source AEGIS01-PHY-SCL Core Services2 147.91.83.218 data_source AEGIS01-PHY-SCL Core Services3 147.91.83.219 data_source AEGIS01-PHY-SCL Core Services4 147.91.83.220 data_source AEGIS01-PHY-SCL1 147.91.83.201 data_source AEGIS01-PHY-SCL2 147.91.83.202 data_source AEGIS01-PHY-SCL3 147.91.83.203 data_source AEGIS01-PHY-SCL4 147.91.83.204 data_source AEGIS01-PHY-SCL5 147.91.83.205 data_source AEGIS01-PHY-SCL6 147.91.83.206 data_source AEGIS01-PHY-SCL7 147.91.83.207 data_source AEGIS01-PHY-SCL8 147.91.83.208 # # Round-Robin Archives # You can specify custom Round-Robin archives here (defaults are listed below) # # RRAs RRA:AVERAGE:0.5:1:240 RRA:AVERAGE:0.5:24:240 RRA:AVERAGE:0.5:168:240 RRA:AVERAGE:0.5:672:240 \ # RRA:AVERAGE:0.5:5760:370 # # #--- # Scalability mode. If on, we summarize over
Re: [Ganglia-general] problem with SummaryInfo
Branimir, It seems as though multicast is not working. I would recommend going to unicast. For your clusters this would mean picking 2 head nodes per cluster: in gmond.conf_AEGIS01-PHY-SCL: cluster { name = AEGIS01-PHY-SCL } udp_send_channel { host = wn01.phy.bg.ac.yu port = 8649 } udp_send_channel { host = wn02.phy.bg.ac.yu port = 8649 } udp_recv_channel { port = 8649 } in gmond.conf_AEGIS01-PHY-SCL_Core_Services: cluster { name = AEGIS01-PHY-SCL Core Services } udp_send_channel { host = se.phy.bg.ac.yu port = 8649 } udp_send_channel { host = rb.phy.bg.ac.yu port = 8649 } udp_recv_channel { port = 8649 } in gmetad.conf running on se.phy.bg.ac.yu: data_source AEGIS01-PHY-SCL Core Services se.phy.bg.ac.yu rb.phy.bg.ac.yu data_source AEGIS01-PHY-SCL wn01.phy.bg.ac.yu wn02.phy.bg.ac.yu If you can get multicast working, good luck, but it can be hard. Unicast is easy. To test the setup I gave you, telnet to a headnode (wn01, wn0 2, se, rb) on either cluster `telnet wn01.phy.bg.ac.yu 8649 | grep 'HOST NAME='`. Do not forget to restart all the gmonds and the gmetad processes after making changes to your configuration files. Good Luck, Ian Branimir Ackovic wrote: Thank You Rick and Martin for quick response! I allready tried configuration that Rick suggest, but it doesn't work. In that configuration I see only one node per data_source (the last one). One week ago, Michael Chang helped me to solve problem with this configuration: data_source AEGIS01-PHY-SCL1 147.91.83.201 data_source AEGIS01-PHY-SCL2 147.91.83.202 data_source AEGIS01-PHY-SCL3 147.91.83.203 ... If I understand, Martin suggest that I need two machines with gmetad (one for each data_source). Now I have gmetad only on server with web frontend (se.phy.bg.ac.yu). It is true that the machines in the two groups do not see each other. Even in same group. I tried: [EMAIL PROTECTED] root]# telnet localhost 8649 | grep grid Connection closed by foreign host. [EMAIL PROTECTED] root]# Both machines ce and grid are in the same data_source with same gmond.conf files. As you said, Martin, I found the problem, but I don't found solution for them. :( You can found my gmond and gmetad conf files in attachment. - Branimir Ackovic E-mail: [EMAIL PROTECTED] Web: http://scl.phy.bg.ac.yu/ Phone: +381 11 3160260, Ext. 152 Fax: +381 11 3162190 Scientific Computing Laboratory Institute of Physics, Belgrade Serbia and Montenegro -
Re: [Ganglia-general] problem with SummaryInfo
That's it! Thanks. pozdrav Acko - Branimir Ackovic E-mail: [EMAIL PROTECTED] Web: http://scl.phy.bg.ac.yu/ Phone: +381 11 3160260, Ext. 152 Fax: +381 11 3162190 Scientific Computing Laboratory Institute of Physics, Belgrade Serbia and Montenegro - On Friday 04 November 2005 20:03, you wrote: Branimir, It seems as though multicast is not working. I would recommend going to unicast. For your clusters this would mean picking 2 head nodes per cluster: in gmond.conf_AEGIS01-PHY-SCL: cluster { name = AEGIS01-PHY-SCL } udp_send_channel { host = wn01.phy.bg.ac.yu port = 8649 } udp_send_channel { host = wn02.phy.bg.ac.yu port = 8649 } udp_recv_channel { port = 8649 } in gmond.conf_AEGIS01-PHY-SCL_Core_Services: cluster { name = AEGIS01-PHY-SCL Core Services } udp_send_channel { host = se.phy.bg.ac.yu port = 8649 } udp_send_channel { host = rb.phy.bg.ac.yu port = 8649 } udp_recv_channel { port = 8649 } in gmetad.conf running on se.phy.bg.ac.yu: data_source AEGIS01-PHY-SCL Core Services se.phy.bg.ac.yu rb.phy.bg.ac.yu data_source AEGIS01-PHY-SCL wn01.phy.bg.ac.yu wn02.phy.bg.ac.yu If you can get multicast working, good luck, but it can be hard. Unicast is easy. To test the setup I gave you, telnet to a headnode (wn01, wn0 2, se, rb) on either cluster `telnet wn01.phy.bg.ac.yu 8649 | grep 'HOST NAME='`. Do not forget to restart all the gmonds and the gmetad processes after making changes to your configuration files. Good Luck, Ian Branimir Ackovic wrote: Thank You Rick and Martin for quick response! I allready tried configuration that Rick suggest, but it doesn't work. In that configuration I see only one node per data_source (the last one). One week ago, Michael Chang helped me to solve problem with this configuration: data_source AEGIS01-PHY-SCL1 147.91.83.201 data_source AEGIS01-PHY-SCL2 147.91.83.202 data_source AEGIS01-PHY-SCL3 147.91.83.203 ... If I understand, Martin suggest that I need two machines with gmetad (one for each data_source). Now I have gmetad only on server with web frontend (se.phy.bg.ac.yu). It is true that the machines in the two groups do not see each other. Even in same group. I tried: [EMAIL PROTECTED] root]# telnet localhost 8649 | grep grid Connection closed by foreign host. [EMAIL PROTECTED] root]# Both machines ce and grid are in the same data_source with same gmond.conf files. As you said, Martin, I found the problem, but I don't found solution for them. :( You can found my gmond and gmetad conf files in attachment. - Branimir Ackovic E-mail: [EMAIL PROTECTED] Web: http://scl.phy.bg.ac.yu/ Phone: +381 11 3160260, Ext. 152 Fax: +381 11 3162190 Scientific Computing Laboratory Institute of Physics, Belgrade Serbia and Montenegro - pozdrav Acko - Branimir Ackovic E-mail: [EMAIL PROTECTED] Web: http://scl.phy.bg.ac.yu/ Phone: +381 11 3160260, Ext. 152 Fax: +381 11 3162190 Scientific Computing Laboratory Institute of Physics, Belgrade Serbia and Montenegro -
Re: [Ganglia-general] problem with SummaryInfo
On 11/4/05, Ian Cunningham [EMAIL PROTECTED] wrote: If you can get multicast working, good luck, but it can be hard. Unicast is easy. To test the setup I gave you, telnet to a headnode (wn01, wn0 2, se, rb) on either cluster `telnet wn01.phy.bg.ac.yu 8649 | grep 'HOST NAME='`. One thing I noticed, that while it incurs a performance hit, I have run Ganglia monitoring over an OpenVPN tunnel, with multicast, when I configure OpenVPN to allow peers to see each other. (Since I lack an actual physical network to connect all the computers I monitor, which are at different physical locations, this is the only way for me to do it.) It's a bit peculiar, and if unicast works, then that's also great. Just wanted to mention that this way also works for me. And BTW, sorry for misleading him about individually listing the individual machines - I forgot about the unicast support that was recently added. (Should've checked.) -- ~Mike - Just my two cents - No man is an island, and no man is unable.