Re: [Ganglia-general] experiences scaling to 150k metrics

Scott Dworkis Wed, 17 Feb 2010 17:04:39 -0800

> Also, we still have a mysterious leak in our gmond (which seems to be related 
> to spoofed metrics, not the bundled-metrics)


interesting... i started another thread about gmond leaks, and i too am 
doing a significant amount of spoofed metrics.  i had wondered if there 
was some relationship to a recent increase of non-zero dmax settings on a 
significant number of custom metrics in my system, but i haven't 
scrutinized that yet... figured i'd check for known leaks with the 
community first... so far it does not sound like it is particularly known.

> Somewhere in the 120 - 150K range, we start having gaps at the cluster
metric & grid metric level; gaps usually form at the grid metric before 
the cluster metric.

i'm organized as a single cluster/grid... i'm grouping hosts multiply with 
an external drraw tagging scheme.  i wonder if the grid level metric 
appears more affected for people like you with multiple clusters because 
grid could be affected by any single cluster... i think that could be 
consistent with udp or io as a root cause.

if/when i get to 300k metrics i'll report back here on how it's going.

-scott


On Wed, 17 Feb 2010, Rick Cobb wrote:

> We did a fork of 3.0.4 that just uses a simple "while xdr_decode()..." loop 
> to handle multiple metrics.  It works pretty well.  One additional trick we 
> applied was allowing the client to send timestamped messages. While that adds 
> a dependency (your machines have to be reliably time-synchronized), it allows 
> our client code (based on a pretty simple Ruby class) to fill a UDP packet 
> before transmitting.  For clients that don't send time-stamped messages 
> (e.g., gmond itself), we send a metric group at a time.  We took our 
> multicast packet volume down by a factor of about 20 doing this, and it 
> allowed us to more than triple the number of metrics per node we run.  The 
> Ruby class sends the packets directly using a socket interface.
>
> Together, we find this architecture far more compelling than the "plug-in" 
> model, since failures or delays in a specific monitor (say, an application 
> monitoring daemon) don't cause failures of the entire monitoring service for 
> a node.  It's also far simpler to deal with process-parallel monitors than 
> thread-parallel ones.
>
> However, it's non-forward compatible; unmodified gmond's ignore all metrics 
> after the first one in the packet (and even that one if it's timestamped). 
> Also, we still have a mysterious leak in our gmond (which seems to be related 
> to spoofed metrics, not the bundled-metrics), so I've never tried submitting 
> the patches. And it wouldn't work for the new format in 3.1.x anyway.  I've 
> meant to get around to re-doing the patches for 3.1, but never have been able 
> to allocate the time.
>
> I'm particularly interested in your 150K number. We don't have any single 
> cluster that goes that high, but I have found that that's about as far as I 
> can take a gmetad (grid), regardless of the number of underlying clusters 
> (I've used ranges from four to 24).  Somewhere in the 120 - 150K range, we 
> start having gaps at the cluster metric & grid metric level; gaps usually 
> form at the grid metric before the cluster metric.  Since gmetad has a thread 
> per cluster, I've always thought the gap was being caused by the mutex that 
> controls grid level metric updates; thanks for the clue about UDP issues.
>
> -- ReC
>
>
> On Feb 17, 2010, at 2:58 PM, Scott Dworkis wrote:
>
>>> packets together, if not thing else), and that would tend to reduce
>>> the packet rate slightly.  Gmond could probably be a little more
>>
>> sure, and would save udp header overhead too.
>>
>> -scott
>>
>> On Wed, 17 Feb 2010, Jesse Becker wrote:
>>
>>> On Wed, Feb 17, 2010 at 16:35, Scott Dworkis <[email protected]> wrote:
>>>> i think i saw the Ganglia-Gmetric and it appeared fork-based, so i had
>>>> scale concerns.
>>>>
>>>> embeddedgmetric looks good... maybe spent 3 years in alpha though?  but if
>>>> it works, who cares.
>>>
>>> Well, there isn't a whole lot to the protocol, so it's possible that
>>> there really isn't much more work that needs to be done. :)
>>>
>>>> one nice thing about the csv approach is it should scale for any language
>>>> that can do standard io... even a shell or sed script.
>>>
>>> Agreed, but neither sed nor awk[1] can send gmetric packets natively;
>>> Perl at least has modules to help with it.
>>>
>>>> i wonder if/how-much a packet is cheaper than a fork?  but yeah ideally
>>>
>>> I would be shocked sending a UDP packet wasn't substantially cheaper
>>> than a full blown fork (which then needs to *also* send a UDP packet).
>>>
>>>> you'd consolidate packets.  as far as i can i haven't bumped into
>>>> packet-rate issues as much as packet-byte-rate issues, which would remain
>>>> even in a consolidated packet scenario.
>>>
>>> Interesting point.  However, consolidating packages does imply a bit
>>> more pre-processing of the data on the sender's side (to glom the
>>> packets together, if not thing else), and that would tend to reduce
>>> the packet rate slightly.  Gmond could probably be a little more
>>> efficient about memory allocation as well if it could allocate a
>>> larger lump of space at once for multiple metrics instead of dealing
>>> with each one piecemeal.
>>>
>>>
>>> [1] gawk has some very odd built-in "device files" that you can use.
>>> Check out the manpage for "Special File names", and then look at the
>>> /inet/udp/lport/rhost/rport "files".
>>>
>>>
>>> --
>>> Jesse Becker
>>> Every cloud has a silver lining, except for the mushroom-shaped ones,
>>> which come lined with strontium-90.
>> <ATT00001..txt><ATT00002..txt>
>

------------------------------------------------------------------------------
Download Intel&reg; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs 
proactively, and fine-tune applications for parallel performance. 
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Re: [Ganglia-general] experiences scaling to 150k metrics

Reply via email to