Chris Smith <csmithx@...> writes:

> 
> Worth a look at OpenTSDB ( http://opentsdb.net/ ) as it doesn't lose
> precision on the historical data.
> It also has some neat tracks around the collection and display of data.
> 
> Another useful tool is 'collectl' ( http://collectl.sourceforge.net/ )
> which is a light weight Perl script that
> both captures and compresses the metrics, manages it's metrics data
> files and then filters and presents
> the metrics as requested.
> 
> I find collectl lightweight and useful enough that I set it up to
> capture everything and
> then leave it running in the background on most systems I build
> because when you need the measurement
> data the event is usually in the past and difficult to reproduce.
> With collectl running I have a week to
> recognise the event and analyse/save the relevant data file(s); data
> file approx. 21MB/node/day gzipped.
> 
> With a little bit of bash or awk or perl scripting you can convert the
> collectl output into a form easily
> loadable into Pig.  Pig also has User Defined Functions (UDFs) that
> can import the Hadoop job history so
> with some Pig Latin you can marry your infrastructure metrics with
> your job metrics; a bit like the cluster
> eating it own dog food.
> 
I'm the author of collectl and Chris gave a very good description.  While 
collectl can't do everything it can do quite a lot.  One thing that is also 
very 
useful is its ability to generate output in plottable, space-separated, format 
which is exactly what gnuplot wants.  So if you install collectl-utils, you can 
use the colplot tool which is basically a web-based front end to gnuplot.  You 
click on a few buttons and out pop one or more plots of various types of data.

There was also a question about asked a 'top' across nodes.  Another tool in 
the 
utils kit is 'colmux',  collectl multiplexer.  It will run instances of 
collectl 
on multiple systems (or log files for historic data) and display the output in 
a 
refreshing screen much like top, sorted by any column you choose.  So you can 
look at top processes, top disks, top memory, etc.  You can even reverse the 
sort to find idle systems, disks, etc.  Invaluable for cluster monitoring.

If you have any specific questions, how about sending them to the collectl 
mailing list or posting them on sourceforge?  Just makes it easier for me to 
spot.

> BTW, watch out for a little gotcha with Ganglia.  It doesn't seem to
> report the full jvm metrics via gmond
> although if you output the jvm metrics to file you get a record for
> each jvm on the node.  I haven't looked
> into it in detail yet but it looks like Gangla only reports the last
> jvm record in each batch. Anyone else seen
> this?

One thing you should be careful about when using ganglia if you're plotting 
data.  It normalizes the output based on the number of data points you're 
trying 
to display and the width of your plot.  So if you have a lot of samples you 
won't necessarily see spikes or outliers.  Fine for a high level view but it 
can 
be real misleading if you looking for short bursts of high values as they just 
won't be there.

-mark

> Chris
> 
> On 24 May 2011 01:48, Tom Melendez <tom@...> wrote:
> > Hi Folks,
> >
> > I'm looking for tips, tricks and tools to get at node utilization to
> > optimize our cluster.  I want answer questions like:
> > - what nodes ran a particular job?
> > - how long did it take for those nodes to run the tasks for that job?
> > - how/why did Hadoop pick those nodes to begin with?
> >
> > More detailed questions like
> > - how much memory did the task for the job use on that node?
> > - average CPU load on that node during the task run
> >
> > And more aggregate questions like:
> > - are some nodes favored more than others?
> > - utilization averages (generally, how many cores on that node are in use, 
etc.)
> >
> > There are plenty more that I'm not asking, but you get the point?  So,
> > what are you guys using for this?
> >
> > I see some mentions of Ganglia, so I'll definitely look into that.
> > Anything else?  Anything you're using to monitor in real-time (like a
> > 'top' across the nodes or something like that)?
> >
> > Any info or war-stories greatly appreciated.
> >
> > Thanks,
> >
> > Tom
> >
> 
> 




Reply via email to