Thanks Chris, these are quite helpful.

Thanks,

Tom

On Tue, May 24, 2011 at 11:13 AM, Chris Smith <[email protected]> wrote:
> Worth a look at OpenTSDB ( http://opentsdb.net/ ) as it doesn't lose
> precision on the historical data.
> It also has some neat tracks around the collection and display of data.
>
> Another useful tool is 'collectl' ( http://collectl.sourceforge.net/ )
> which is a light weight Perl script that
> both captures and compresses the metrics, manages it's metrics data
> files and then filters and presents
> the metrics as requested.
>
> I find collectl lightweight and useful enough that I set it up to
> capture everything and
> then leave it running in the background on most systems I build
> because when you need the measurement
> data the event is usually in the past and difficult to reproduce.
> With collectl running I have a week to
> recognise the event and analyse/save the relevant data file(s); data
> file approx. 21MB/node/day gzipped.
>
> With a little bit of bash or awk or perl scripting you can convert the
> collectl output into a form easily
> loadable into Pig.  Pig also has User Defined Functions (UDFs) that
> can import the Hadoop job history so
> with some Pig Latin you can marry your infrastructure metrics with
> your job metrics; a bit like the cluster
> eating it own dog food.
>
> BTW, watch out for a little gotcha with Ganglia.  It doesn't seem to
> report the full jvm metrics via gmond
> although if you output the jvm metrics to file you get a record for
> each jvm on the node.  I haven't looked
> into it in detail yet but it looks like Gangla only reports the last
> jvm record in each batch. Anyone else seen
> this?
>
> Chris
>
> On 24 May 2011 01:48, Tom Melendez <[email protected]> wrote:
>> Hi Folks,
>>
>> I'm looking for tips, tricks and tools to get at node utilization to
>> optimize our cluster.  I want answer questions like:
>> - what nodes ran a particular job?
>> - how long did it take for those nodes to run the tasks for that job?
>> - how/why did Hadoop pick those nodes to begin with?
>>
>> More detailed questions like
>> - how much memory did the task for the job use on that node?
>> - average CPU load on that node during the task run
>>
>> And more aggregate questions like:
>> - are some nodes favored more than others?
>> - utilization averages (generally, how many cores on that node are in use, 
>> etc.)
>>
>> There are plenty more that I'm not asking, but you get the point?  So,
>> what are you guys using for this?
>>
>> I see some mentions of Ganglia, so I'll definitely look into that.
>> Anything else?  Anything you're using to monitor in real-time (like a
>> 'top' across the nodes or something like that)?
>>
>> Any info or war-stories greatly appreciated.
>>
>> Thanks,
>>
>> Tom
>>
>

Reply via email to