> Also, we still have a mysterious leak in our gmond (which seems to be related > to spoofed metrics, not the bundled-metrics)
interesting... i started another thread about gmond leaks, and i too am doing a significant amount of spoofed metrics. i had wondered if there was some relationship to a recent increase of non-zero dmax settings on a significant number of custom metrics in my system, but i haven't scrutinized that yet... figured i'd check for known leaks with the community first... so far it does not sound like it is particularly known. > Somewhere in the 120 - 150K range, we start having gaps at the cluster metric & grid metric level; gaps usually form at the grid metric before the cluster metric. i'm organized as a single cluster/grid... i'm grouping hosts multiply with an external drraw tagging scheme. i wonder if the grid level metric appears more affected for people like you with multiple clusters because grid could be affected by any single cluster... i think that could be consistent with udp or io as a root cause. if/when i get to 300k metrics i'll report back here on how it's going. -scott On Wed, 17 Feb 2010, Rick Cobb wrote: > We did a fork of 3.0.4 that just uses a simple "while xdr_decode()..." loop > to handle multiple metrics. It works pretty well. One additional trick we > applied was allowing the client to send timestamped messages. While that adds > a dependency (your machines have to be reliably time-synchronized), it allows > our client code (based on a pretty simple Ruby class) to fill a UDP packet > before transmitting. For clients that don't send time-stamped messages > (e.g., gmond itself), we send a metric group at a time. We took our > multicast packet volume down by a factor of about 20 doing this, and it > allowed us to more than triple the number of metrics per node we run. The > Ruby class sends the packets directly using a socket interface. > > Together, we find this architecture far more compelling than the "plug-in" > model, since failures or delays in a specific monitor (say, an application > monitoring daemon) don't cause failures of the entire monitoring service for > a node. It's also far simpler to deal with process-parallel monitors than > thread-parallel ones. > > However, it's non-forward compatible; unmodified gmond's ignore all metrics > after the first one in the packet (and even that one if it's timestamped). > Also, we still have a mysterious leak in our gmond (which seems to be related > to spoofed metrics, not the bundled-metrics), so I've never tried submitting > the patches. And it wouldn't work for the new format in 3.1.x anyway. I've > meant to get around to re-doing the patches for 3.1, but never have been able > to allocate the time. > > I'm particularly interested in your 150K number. We don't have any single > cluster that goes that high, but I have found that that's about as far as I > can take a gmetad (grid), regardless of the number of underlying clusters > (I've used ranges from four to 24). Somewhere in the 120 - 150K range, we > start having gaps at the cluster metric & grid metric level; gaps usually > form at the grid metric before the cluster metric. Since gmetad has a thread > per cluster, I've always thought the gap was being caused by the mutex that > controls grid level metric updates; thanks for the clue about UDP issues. > > -- ReC > > > On Feb 17, 2010, at 2:58 PM, Scott Dworkis wrote: > >>> packets together, if not thing else), and that would tend to reduce >>> the packet rate slightly. Gmond could probably be a little more >> >> sure, and would save udp header overhead too. >> >> -scott >> >> On Wed, 17 Feb 2010, Jesse Becker wrote: >> >>> On Wed, Feb 17, 2010 at 16:35, Scott Dworkis <[email protected]> wrote: >>>> i think i saw the Ganglia-Gmetric and it appeared fork-based, so i had >>>> scale concerns. >>>> >>>> embeddedgmetric looks good... maybe spent 3 years in alpha though? but if >>>> it works, who cares. >>> >>> Well, there isn't a whole lot to the protocol, so it's possible that >>> there really isn't much more work that needs to be done. :) >>> >>>> one nice thing about the csv approach is it should scale for any language >>>> that can do standard io... even a shell or sed script. >>> >>> Agreed, but neither sed nor awk[1] can send gmetric packets natively; >>> Perl at least has modules to help with it. >>> >>>> i wonder if/how-much a packet is cheaper than a fork? but yeah ideally >>> >>> I would be shocked sending a UDP packet wasn't substantially cheaper >>> than a full blown fork (which then needs to *also* send a UDP packet). >>> >>>> you'd consolidate packets. as far as i can i haven't bumped into >>>> packet-rate issues as much as packet-byte-rate issues, which would remain >>>> even in a consolidated packet scenario. >>> >>> Interesting point. However, consolidating packages does imply a bit >>> more pre-processing of the data on the sender's side (to glom the >>> packets together, if not thing else), and that would tend to reduce >>> the packet rate slightly. Gmond could probably be a little more >>> efficient about memory allocation as well if it could allocate a >>> larger lump of space at once for multiple metrics instead of dealing >>> with each one piecemeal. >>> >>> >>> [1] gawk has some very odd built-in "device files" that you can use. >>> Check out the manpage for "Special File names", and then look at the >>> /inet/udp/lport/rhost/rport "files". >>> >>> >>> -- >>> Jesse Becker >>> Every cloud has a silver lining, except for the mushroom-shaped ones, >>> which come lined with strontium-90. >> <ATT00001..txt><ATT00002..txt> > ------------------------------------------------------------------------------ Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev _______________________________________________ Ganglia-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/ganglia-general

