Thanks Chris, these are quite helpful. Thanks,
Tom On Tue, May 24, 2011 at 11:13 AM, Chris Smith <[email protected]> wrote: > Worth a look at OpenTSDB ( http://opentsdb.net/ ) as it doesn't lose > precision on the historical data. > It also has some neat tracks around the collection and display of data. > > Another useful tool is 'collectl' ( http://collectl.sourceforge.net/ ) > which is a light weight Perl script that > both captures and compresses the metrics, manages it's metrics data > files and then filters and presents > the metrics as requested. > > I find collectl lightweight and useful enough that I set it up to > capture everything and > then leave it running in the background on most systems I build > because when you need the measurement > data the event is usually in the past and difficult to reproduce. > With collectl running I have a week to > recognise the event and analyse/save the relevant data file(s); data > file approx. 21MB/node/day gzipped. > > With a little bit of bash or awk or perl scripting you can convert the > collectl output into a form easily > loadable into Pig. Pig also has User Defined Functions (UDFs) that > can import the Hadoop job history so > with some Pig Latin you can marry your infrastructure metrics with > your job metrics; a bit like the cluster > eating it own dog food. > > BTW, watch out for a little gotcha with Ganglia. It doesn't seem to > report the full jvm metrics via gmond > although if you output the jvm metrics to file you get a record for > each jvm on the node. I haven't looked > into it in detail yet but it looks like Gangla only reports the last > jvm record in each batch. Anyone else seen > this? > > Chris > > On 24 May 2011 01:48, Tom Melendez <[email protected]> wrote: >> Hi Folks, >> >> I'm looking for tips, tricks and tools to get at node utilization to >> optimize our cluster. I want answer questions like: >> - what nodes ran a particular job? >> - how long did it take for those nodes to run the tasks for that job? >> - how/why did Hadoop pick those nodes to begin with? >> >> More detailed questions like >> - how much memory did the task for the job use on that node? >> - average CPU load on that node during the task run >> >> And more aggregate questions like: >> - are some nodes favored more than others? >> - utilization averages (generally, how many cores on that node are in use, >> etc.) >> >> There are plenty more that I'm not asking, but you get the point? So, >> what are you guys using for this? >> >> I see some mentions of Ganglia, so I'll definitely look into that. >> Anything else? Anything you're using to monitor in real-time (like a >> 'top' across the nodes or something like that)? >> >> Any info or war-stories greatly appreciated. >> >> Thanks, >> >> Tom >> >
