Worth a look at OpenTSDB ( http://opentsdb.net/ ) as it doesn't lose
precision on the historical data.
It also has some neat tracks around the collection and display of data.

Another useful tool is 'collectl' ( http://collectl.sourceforge.net/ )
which is a light weight Perl script that
both captures and compresses the metrics, manages it's metrics data
files and then filters and presents
the metrics as requested.

I find collectl lightweight and useful enough that I set it up to
capture everything and
then leave it running in the background on most systems I build
because when you need the measurement
data the event is usually in the past and difficult to reproduce.
With collectl running I have a week to
recognise the event and analyse/save the relevant data file(s); data
file approx. 21MB/node/day gzipped.

With a little bit of bash or awk or perl scripting you can convert the
collectl output into a form easily
loadable into Pig.  Pig also has User Defined Functions (UDFs) that
can import the Hadoop job history so
with some Pig Latin you can marry your infrastructure metrics with
your job metrics; a bit like the cluster
eating it own dog food.

BTW, watch out for a little gotcha with Ganglia.  It doesn't seem to
report the full jvm metrics via gmond
although if you output the jvm metrics to file you get a record for
each jvm on the node.  I haven't looked
into it in detail yet but it looks like Gangla only reports the last
jvm record in each batch. Anyone else seen
this?

Chris

On 24 May 2011 01:48, Tom Melendez <[email protected]> wrote:
> Hi Folks,
>
> I'm looking for tips, tricks and tools to get at node utilization to
> optimize our cluster.  I want answer questions like:
> - what nodes ran a particular job?
> - how long did it take for those nodes to run the tasks for that job?
> - how/why did Hadoop pick those nodes to begin with?
>
> More detailed questions like
> - how much memory did the task for the job use on that node?
> - average CPU load on that node during the task run
>
> And more aggregate questions like:
> - are some nodes favored more than others?
> - utilization averages (generally, how many cores on that node are in use, 
> etc.)
>
> There are plenty more that I'm not asking, but you get the point?  So,
> what are you guys using for this?
>
> I see some mentions of Ganglia, so I'll definitely look into that.
> Anything else?  Anything you're using to monitor in real-time (like a
> 'top' across the nodes or something like that)?
>
> Any info or war-stories greatly appreciated.
>
> Thanks,
>
> Tom
>

Reply via email to