Hello Bernard,

thank you for your quick reply.

Bernard Li schrieb:
Hi Andres:

On Fri, Apr 1, 2011 at 7:22 AM, Andres Lindau <andres.lin...@embl.de> wrote:

we are running Ganglia in a multi Linux environment with 336 HPC hosts
and around 20 standalone servers.
Most of the hosts have CentOS5 and some SuSE or RedHat.
The server has Ganglia 3.1.7. and the client are mixed with versions
starting from 3.0.3.

While gmetad 3.1.x can communicate with gmond 3.0.x fine, you cannot
mix different versions of gmond at the same level (i.e. gmond 3.1.x
cannot talk to gmond 3.0.x).  Not sure if you are doing that or not,
perhaps you should clarify.


The Server is gmetad 3.1.7 and the clients (talking to the server) are mixed versions.

Most of the nodes send data and graphs are also displayed (all 336
cluster nodes and most of the other hosts).
Some hosts don't even show up on the Ganglia page in spite of telnet to
the configured port is working and giving back an XML file.
If we move this "non working node" e.g. to a new data source with a
different name it suddenly appears after restarting gmetad - some don't.
By moving it back to the old data source it disappears again.
I triple checked all configuration files on the server and the clients
and i debugged the gmetad-daemon (all data sources were found and there
were no errors)
and the gmon-daemon(s) on the particular host(s) (all modules were
loaded and no errors ocurred).

Please post the data_source(s) lines from your gmetad.conf and diffs
of your gmond.conf from the default config.  If they are too big,
please compress them and/or paste them at pastebin.com and reference
them here.  Are you using unicast or multicast?

We are using multicast - here are the data_source lines of gmetad.conf:

[root@monitor2 ~]# cat /etc/ganglia/gmetad.conf | grep data_source
# The data_source tag specifies either a cluster or a grid to
# data_source "my cluster" [polling interval] address1:port addreses2:port ...
# The keyword 'data_source' must immediately be followed by a unique
# data_source "my cluster" 10 localhost  my.machine.edu:8649  1.2.3.5:8655
# data_source "my grid" 50 1.3.4.7:8655 grid.org:8651 grid-backup.org:8651
# data_source "another source" 1.3.4.7:8655  1.3.4.8
#data_source "Monitor" localhost:8652
data_source "ClusterNG-BC01" clnode1:8651 clnode2:8651 clnode3:8651 clnode4:8651 clnode5:8651 clnode6:8651 clnode7:8651 clnode8:8651 clnode9:8651 clnode10:8651 clnode11:8651 clnode12:8651 clnode13:8651 clnode14:8651 data_source "ClusterNG-BC02" clnode15:8652 clnode16:8652 clnode17:8652 clnode18:8652 clnode19:8652 clnode20:8652 clnode21:8652 clnode22:8652 clnode23:8652 clnode24:8652 clnode25:8652 clnode26:8652 clnode27:8652 clnode28:8652 data_source "ClusterNG-BC03" clnode29:8653 clnode30:8653 clnode31:8653 clnode32:8653 clnode33:8653 clnode34:8653 clnode35:8653 clnode36:8653 clnode37:8653 clnode38:8653 clnode39:8653 clnode40:8653 clnode41:8653 clnode42:8653 data_source "ClusterNG-BC04" clnode43:8654 clnode44:8654 clnode45:8654 clnode46:8654 clnode47:8654 clnode48:8654 clnode49:8654 clnode50:8654 clnode51:8654 clnode52:8654 clnode53:8654 clnode54:8654 clnode55:8654 clnode56:8654 data_source "ClusterNG-BC05" clnode57:8655 clnode58:8655 clnode59:8655 clnode60:8655 clnode61:8655 clnode62:8655 clnode63:8655 clnode64:8655 clnode65:8655 clnode66:8655 clnode67:8655 clnode68:8655 clnode69:8655 clnode70:8655 data_source "ClusterNG-BC06" clnode71:8656 clnode72:8656 clnode73:8656 clnode74:8656 clnode75:8656 clnode76:8656 clnode77:8656 clnode78:8656 clnode79:8656 clnode80:8656 clnode81:8656 clnode82:8656 clnode83:8656 clnode84:8656 data_source "ClusterNG-BC07" clnode85:8657 clnode86:8657 clnode87:8657 clnode88:8657 clnode89:8657 clnode90:8657 clnode91:8657 clnode92:8657 clnode93:8657 clnode94:8657 clnode95:8657 clnode96:8657 clnode97:8657 clnode98:8657 data_source "ClusterNG-BC08" clnode99:8658 clnode100:8658 clnode101:8658 clnode102:8658 clnode103:8658 clnode104:8658 clnode105:8658 clnode106:8658 clnode107:8658 clnode108:8658 clnode109:8658 clnode110:8658 clnode111:8658 clnode112:8658 data_source "ClusterNG-BC09" clnode113:8659 clnode114:8659 clnode115:8659 clnode116:8659 clnode117:8659 clnode118:8659 clnode119:8659 clnode120:8659 clnode121:8659 clnode122:8659 clnode123:8659 clnode124:8659 clnode125:8659 clnode126:8659 data_source "ClusterNG-BC10" clnode127:8660 clnode128:8660 clnode129:8660 clnode130:8660 clnode131:8660 clnode132:8660 clnode133:8660 clnode134:8660 clnode135:8660 clnode136:8660 clnode137:8660 clnode138:8660 clnode139:8660 clnode140:8660 data_source "ClusterNG-BC11" clnode141:8661 clnode142:8661 clnode143:8661 clnode144:8661 clnode145:8661 clnode146:8661 clnode147:8661 clnode148:8661 clnode149:8661 clnode150:8661 clnode151:8661 clnode152:8661 clnode153:8661 clnode154:8661 data_source "ClusterNG-BC12" clnode155:8662 clnode156:8662 clnode157:8662 clnode158:8662 clnode159:8662 clnode160:8662 clnode161:8662 clnode162:8662 clnode163:8662 clnode164:8662 clnode165:8662 clnode166:8662 clnode167:8662 clnode168:8662 data_source "ClusterNG-BC13" clnode169:8663 clnode170:8663 clnode171:8663 clnode172:8663 clnode173:8663 clnode174:8663 clnode175:8663 clnode176:8663 clnode177:8663 clnode178:8663 clnode179:8663 clnode180:8663 clnode181:8663 clnode182:8663 data_source "ClusterNG-BC14" clnode183:8664 clnode184:8664 clnode185:8664 clnode186:8664 clnode187:8664 clnode188:8664 clnode189:8664 clnode190:8664 clnode191:8664 clnode192:8664 clnode193:8664 clnode194:8664 clnode195:8664 clnode196:8664 data_source "ClusterNG-BC15" clnode197:8665 clnode198:8665 clnode199:8665 clnode200:8665 clnode201:8665 clnode202:8665 clnode203:8665 clnode204:8665 clnode205:8665 clnode206:8665 clnode207:8665 clnode208:8665 clnode209:8665 clnode210:8665 data_source "ClusterNG-BC16" clnode215:8666 clnode216:8666 clnode217:8666 clnode218:8666 clnode219:8666 clnode220:8666 clnode221:8666 clnode222:8666 clnode223:8666 clnode224:8666 #data_source "ClusterNG-BC16" clnode211:8666 clnode212:8666 clnode213:8666 clnode214:8666 clnode215:8666 clnode216:8666 clnode217:8666 clnode218:8666 clnode219:8666 clnode220:8666 clnode221:8666 clnode222:8666 clnode223:8666 clnode224:8666 data_source "ClusterNG-BC17" clnode225:8667 clnode226:8667 clnode227:8667 clnode228:8667 clnode229:8667 clnode230:8667 clnode231:8667 clnode232:8667 clnode233:8667 clnode234:8667 clnode235:8667 clnode236:8667 clnode237:8667 clnode238:8667 data_source "ClusterNG-BC18" clnode239:8668 clnode240:8668 clnode241:8668 clnode242:8668 clnode243:8668 clnode244:8668 clnode245:8668 clnode246:8668 clnode247:8668 clnode248:8668 clnode249:8668 clnode250:8668 clnode251:8668 clnode252:8668 data_source "ClusterNG-BC19" clnode253:8669 clnode254:8669 clnode255:8669 clnode256:8669 clnode257:8669 clnode258:8669 clnode259:8669 clnode260:8669 clnode261:8669 clnode262:8669 clnode263:8669 clnode264:8669 clnode265:8669 clnode266:8669 data_source "ClusterNG-BC20" clnode267:8670 clnode268:8670 clnode269:8670 clnode270:8670 clnode271:8670 clnode272:8670 clnode273:8670 clnode274:8670 clnode275:8670 clnode276:8670 clnode277:8670 clnode278:8670 clnode279:8670 clnode280:8670 data_source "ClusterNG-BC21" clnode281:8671 clnode282:8671 clnode283:8671 clnode284:8671 clnode285:8671 clnode286:8671 clnode287:8671 clnode288:8671 clnode289:8671 clnode290:8671 clnode291:8671 clnode292:8671 clnode293:8671 clnode294:8671 data_source "ClusterNG-BC22" clnode295:8672 clnode296:8672 clnode297:8672 clnode298:8672 clnode299:8672 clnode300:8672 clnode301:8672 clnode302:8672 clnode303:8672 clnode304:8672 clnode305:8672 clnode306:8672 clnode307:8672 clnode308:8672 data_source "ClusterNG-BC23" clnode309:8673 clnode310:8673 clnode311:8673 clnode312:8673 clnode313:8673 clnode314:8673 clnode315:8673 clnode316:8673 clnode317:8673 clnode318:8673 clnode319:8673 clnode320:8673 clnode321:8673 clnode322:8673 data_source "ClusterNG-BC24" clnode323:8674 clnode324:8674 clnode325:8674 clnode326:8674 clnode327:8674 clnode328:8674 clnode329:8674 clnode330:8674 clnode331:8674 clnode332:8674 clnode333:8674 clnode334:8674 clnode335:8674 clnode336:8674 data_source "Mail-Servers" lxmail01-vm.embl.de:8649 lxmail03.embl.de:8649 mail.embl.it:8649
data_source "Mail-ServerIT" mail.embl.it:8649
data_source "PBS-Servers" clmaster.embl.de:8650 clmaster-vm.embl.de:8650 pbs-master2.embl.de:8650 shadow-master2.embl.de:8650 data_source "Solexa" clnode197:8665 clnode198:8665 clnode199:8665 clnode200:8665 clnode201:8665 clnode202:8665 clnode203:8665 clnode204:8665 clnode205:8665 clnode206:8665 clnode207:8665 clnode208:8665 clnode209:8665 clnode210:8665 clnode211:8665 clnode212:8665 clnode213:8665 clnode214:8665
data_source "Various" ocs.embl.org:8648 localhost:8648
data_source "Web-Servers" r12s35.EMBL-Heidelberg.DE:8675 db11g:8675 r2s12:8675 searchmd.embl.org:8675 web1:8675 www-db:8675 nps:8675 nps2:8675

The diff between the standard config and our config is attached to this e-mail.

Cheers,

Bernard

------------------------------------------------------------------------------
Create and publish websites with WebMatrix
Use the most popular FREE web apps or write code yourself; WebMatrix provides all the features you need to develop and publish your website. http://p.sf.net/sfu/ms-webmatrix-sf
_______________________________________________
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Thank you for your support.

Best regards,
Andres Lindau
11d10
<   allow_extra_data = yes
15d13
<   send_metadata_interval = 0 /*secs */
18,21c16,18
< /*
<  * The cluster attributes specified will be used as part of the <CLUSTER>
<  * tag that will wrap all hosts collected by this instance.
<  */
---
> /* If a cluster attribute is specified, then all gmond hosts are wrapped inside
>  * of a <CLUSTER> tag.  If you do not specify a cluster tag, then all <HOSTS> will
>  * NOT be wrapped inside of a <CLUSTER> tag. */
23,24c20,21
<   name = "unspecified"
<   owner = "unspecified"
---
>   name = "Web-Servers"
>   owner = "EMBL Heidelberg"
31c28
<   location = "unspecified"
---
>   location = "Building 13, Datacenter"
37,42d33
<   #bind_hostname = yes # Highly recommended, soon to be default.
<                        # This option tells gmond to use a source address
<                        # that resolves to the machine's hostname.  Without
<                        # this, the metrics may appear to come from any
<                        # interface and the DNS names associated with
<                        # those IPs will be used to create the RRDs.
44c35
<   port = 8649
---
>   port = 8675
51c42
<   port = 8649
---
>   port = 8675
58c49
<   port = 8649
---
>   port = 8675
61,99d51
< /* Each metrics module that is referenced by gmond must be specified and
<    loaded. If the module has been statically linked with gmond, it does
<    not require a load path. However all dynamically loadable modules must
<    include a load path. */
< modules {
<   module {
<     name = "core_metrics"
<   }
<   module {
<     name = "cpu_module"
<     path = "modcpu.so"
<   }
<   module {
<     name = "disk_module"
<     path = "moddisk.so"
<   }
<   module {
<     name = "load_module"
<     path = "modload.so"
<   }
<   module {
<     name = "mem_module"
<     path = "modmem.so"
<   }
<   module {
<     name = "net_module"
<     path = "modnet.so"
<   }
<   module {
<     name = "proc_module"
<     path = "modproc.so"
<   }
<   module {
<     name = "sys_module"
<     path = "modsys.so"
<   }
< }
< 
< include ('/etc/ganglia/conf.d/*.conf')
114,115c66,67
<   }
< }
---
>   } 
> } 
117,120c69,70
< /* This collection group will send general info about this host every
<    1200 secs.
<    This information doesn't change between reboots and is only collected
<    once. */
---
> /* This collection group will send general info about this host every 1200 secs. 
>    This information doesn't change between reboots and is only collected once. */
126d75
<     title = "CPU Count"
130d78
<     title = "CPU Speed"
134d81
<     title = "Memory Total"
139d85
<     title = "Swap Space Total"
143d88
<     title = "Last Boot Time"
147d91
<     title = "Machine Type"
151d94
<     title = "Operating System"
155d97
<     title = "Operating System Release"
159d100
<     title = "Location"
163,165c104,105
< /* This collection group will send the status of gexecd for this host
<    every 300 secs.*/
< /* Unlike 2.5.x the default behavior is to report gexecd OFF. */
---
> /* This collection group will send the status of gexecd for this host every 300 secs */
> /* Unlike 2.5.x the default behavior is to report gexecd OFF.  */
171d110
<     title = "Gexec Status"
176,178c115,116
<    The time threshold is set to 90 seconds.  In honesty, this
<    time_threshold could be set significantly higher to reduce
<    unneccessary  network chatter. */
---
>    The time threshold is set to 90 seconds.  In honesty, this time_threshold could be
>    set significantly higher to reduce unneccessary network chatter. */
186d123
<     title = "CPU User"
191d127
<     title = "CPU System"
195,196c131
<     value_threshold = "5.0"
<     title = "CPU Idle"
---
>     value_threshold = "5.0" 
201d135
<     title = "CPU Nice"
206d139
<     title = "CPU aidle"
211d143
<     title = "CPU wio"
218d149
<     title = "CPU intr"
223d153
<     title = "CPU sintr"
235d164
<     title = "One Minute Load Average"
240d168
<     title = "Five Minute Load Average"
245d172
<     title = "Fifteen Minute Load Average"
256d182
<     title = "Total Running Processes"
261d186
<     title = "Total Processes"
274d198
<     title = "Free Memory"
279d202
<     title = "Shared Memory"
284d206
<     title = "Memory Buffers"
289d210
<     title = "Cached Memory"
294d214
<     title = "Free Swap Space"
304d223
<     title = "Bytes Sent"
309d227
<     title = "Bytes Received"
314d231
<     title = "Packets Received"
319d235
<     title = "Packets Sent"
330d245
<     title = "Total Disk Space"
340d254
<     title = "Disk Space Available"
345d258
<     title = "Maximum Disk Space Used"
348d260
< 
------------------------------------------------------------------------------
Create and publish websites with WebMatrix
Use the most popular FREE web apps or write code yourself; 
WebMatrix provides all the features you need to develop and 
publish your website. http://p.sf.net/sfu/ms-webmatrix-sf
_______________________________________________
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Reply via email to