Doug, OK, I gave the paper a careful read. That was good work I must say, and now I see where you are coming from. Questions. What is the polling interval you would use? Do you plan multicast? Anyway -
Caveats: - the authors are the ones that really know the internals. - Our clusters are/were Monte Carlo, so no IPC inside the algorithms. - Our clusters had fast polling intervals (5-10 seconds). Gmond: - single threaded for metric collection. metrics grouped into collection groups. - Collection groups are polled. The sleep interval is for the shortest interval from now to the next group that needs waking. - Send is decoupled from collect. Send happens after a longer timeout or a metric value exceeds a threshold (as you know). So in theory, as "now" is not explicitly aligned, gmond sleeps and UDP sends should be stochastic. Are they? Well, no, not really. - using whole seconds and calling sleep() makes for really grainy gmond sleep intervals. gmonds with small sleep intervals will align on second boundaries and spurt UDP together. - When a cluster starts computing, all nodes will see a cpu load spike exceeding the configured threshold at much the same time. - On a cluster with gmond multicasting, it is obviously true that gmond computation spikes will align. They slurp together. Multicast - just say no. It should be pretty easy to see the above behaviour by snoop/tcpdumping at a headnode. - I have observed unexplained delays every now and again in getting XML data via TCP from the headnode. headnodes being subnet local but not in cluster may be worth considering BTW. gmetad: - One thread per data_source, and some other threads for this and that. - For 10 seconds polling I am fairly sure that the threads end up aligning themselves. Resonance as the paper said. I don't know why. Summary? Resonance and gmond load spike synchronicity make cause you some compute jitter on nodes. But if you want to observe load during calculations, well you want to observe load. And gmond is pretty lightweight for that task and I am unaware of anything "lighter". gmetad level scaling problems are real but can be managed through SAN or tmpfs use, grouping nodes to clusters then clusters to grids. Aside: Fast ganglia polling requires hacks to remove statically defined and long sleeps in several places (to add jitter to threads). - richard -- ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Ganglia-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/ganglia-general

