[jira] Commented: (MAPREDUCE-323) Improve the way job history files are managed

Doug Cutting (JIRA) Fri, 30 Oct 2009 09:40:23 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771987#action_12771987
 ]


Doug Cutting commented on MAPREDUCE-323:
----------------------------------------

> Options for the directory structure of the history files are

Nick Rettinghouse, Tim Williamson, and Rajiv Chittajallu all suggested a 
preference for per-hour directories, in particular, USER/YYYY/MM/DD/HH, an 
option you did not list.  Should we perhaps err on the side of a deeper 
structure, to ensure that we don't have to re-structure things again?

I like the idea of a cache of recent jobs in the JobTracker.  This can be 
initialized by walking this directory tree, then maintained incrementally.  
However implementing Cluster.getJobHistoryUrl() would be expensive for archived 
jobs, since the jobtracker must search the entire directory tree.

Perhaps the directory structure should instead be based purely on the job ID?  
E.g., something like:
  jobtrackerstarttime/00/00/00
  jobtrackerstarttime/00/00/01
  ...
  jobtrackerstarttime/00/00/99
  jobtrackerstarttime/00/01/00
etc.

Only if a jobtracker ran more than 1M jobs would its top-level directory have 
more than 100 entries.  Constructing the cache of recent jobs would be fast, as 
would Cluster.getJobHistoryUrl(JobID).  Access to jobs in the cache by user id, 
date, etc. could be fast, since the cache is in memory.  Access to older jobs 
by user id, date, etc. would not be supported.

As an enhancement, we might later place index files in the higher-level 
directories, listing job ids sorted by username, date, etc.  These might be 
written to leaf directories after the 100th job is added to a directory, and to 
non-leaves after the 10,000th job is added, etc.  They could be generated from 
the cache.  With such indexes, user and time-based queries to the archives 
could be resolved in logn time.

> Improve the way job history files are managed
> ---------------------------------------------
>
>                 Key: MAPREDUCE-323
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-323
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: jobtracker
>    Affects Versions: 0.21.0, 0.22.0
>            Reporter: Amar Kamat
>            Assignee: Amareshwari Sriramadasu
>            Priority: Critical
>
> Today all the jobhistory files are dumped in one _job-history_ folder. This 
> can cause problems when there is a need to search the history folder 
> (job-recovery etc). It would be nice if we group all the jobs under a _user_ 
> folder. So all the jobs for user _amar_ will go in _history-folder/amar/_. 
> Jobs can be categorized using various features like _jobid, date, jobname_ 
> etc but using _username_ will make the search much more efficient and also 
> will not result into namespace explosion. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-323) Improve the way job history files are managed

Reply via email to