There are several ways to get this information, but all methods seem
suboptimal and Hadoop version dependent.  Job History subsystem has
been fragmented into various different method of collection.  There is
currently no standard method of collection.  I will list the known
method for collection but Chukwa needs to be updated in order to
collect Job History information.

1. For Hadoop 0.20.[0-2], Chukwa has a hook into
JobTrackerInstrumentation, this enables JobTracker to dispatch Chukwa
Agent command to stream JobHistory file to Chukwa Agent when a job is
started.  Unfortunately, jobtracker internal code used to list job
history file in log directory was poorly written code.  When there are
a lot of jobs written, the get job history file name code would stall
jobtracker.  JobTrackerInstrumentation API is not carrying forward for
Hadoop 0.23, therefore, this method should be marked as deprecated.

2. For capacity scheduler, it loads job history file into user's log
directory after the job is done.  A secondary process can be
monitoring the log directory and determine the delta in log files and
load the data to Chukwa.  This is more of a hack than a working
method.  I don't recommend to implement this approach.

3. Aggregate mapreduce client trace log file through
Log4JSocketAppender.  In Chukwa conf directory, there is a
hadoop-log4j.properties file which is designed to deploy with Hadoop.
In this configuration file, it will write out mr_clienttrace.log from
task tracker and stream the data to Chukwa.  Map reduce client trace
log is related to activity for each task attempt, and it can be
aggregated by task then aggregate by job id to form the base analysis
of Job utilization.  However, mr_clienttrace does not indicate the
starting and finishing of a job.  Therefore, JobTracker should still
send out the JobHistory summary by some means which I have not fully
explored in current mapreduce framework.

The third approach existed in Hadoop for a while, and it might be a
good starting point to show the job utilization.  It would be nice if
a couple developers would volunteer to help out on putting the puzzles
together.

regards,
Eric

On Wed, Jul 6, 2011 at 9:41 AM, Preetam Patil <[email protected]> wrote:
> Thanks Eric.
> I was mainly interested in the per-job/task info, similar to thtat provided
> by UserDailySummary.pig
> because per-job/task info seems to be missing from the HBase info.
> Any pointers on how to get a table of jobs into HBase are welcome :-)
> -preetam
>
> On Wed, Jul 6, 2011 at 9:12 PM, Eric Yang <[email protected]> wrote:
>>
>> Hi Preetam,
>>
>> ClusterSummary.pig is the only one that works with HBase.  Other pig
>> scripts are designed to work on sequence files for Chukwa 0.4.  The
>> scripts are thrown together at crunch time.  There is no
>> documentation.  The ChukwaLoader/ChukwaStore function needs to be
>> revised to use HBase to bring the scripts up to date with Chukwa 0.5.
>> Hadoop_*.pig scripts are for down sampling of data from raw resolution
>> into specified time resolution, i.e. 30 minutes average, or 180
>> minutes average.  UserDailySummary.pig is design to aggregate data
>> from JobHistory log files to generate a user usage report.  However,
>> this was designed to work on JobHistory file for Hadoop 0.18.  I don't
>> think it works with Hadoop 0.20+ because JobHistory format changed in
>> Hadoop 0.20.
>>
>> regards,
>> Eric
>>
>> On Wed, Jul 6, 2011 at 5:04 AM, Preetam Patil <[email protected]>
>> wrote:
>> > Hi,
>> > I notice that there are a bunch of Pig scripts in scripts/pig directory,
>> > only ClusterSummary.pig seems
>> > to be mentioned in the documentation. The other scripts also seem to be
>> > based on a storage model
>> > other than HBase, but provide more info (e.g., per-job/task stats) than
>> > that
>> > stored in HBase.
>> > Are these compatible with 0.5, and if not, what needs to be done to get
>> > them working and
>> > where can I find any API info for them?
>> > Thanks,
>> > -preetam
>> >
>
>

Reply via email to