[jira] Commented: (MAPREDUCE-323) Improve the way job history files are managed

Amareshwari Sriramadasu (JIRA) Thu, 29 Oct 2009 22:59:27 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771818#action_12771818
 ]


Amareshwari Sriramadasu commented on MAPREDUCE-323:
---------------------------------------------------

bq. I would also request for removing jobname from the history filename. 
This is done as part of MAPREDUCE-157. I will port the change to Yahoo! 
distribution with this patch.

Options for the directory structure of the history files are
# {$hadoop.log.dir}/history/done/YYYY-MM-DD/
# {$hadoop.log.dir}/history/done/YYYY-MM-DD/USER
# {$hadoop.log.dir}/history/done/USER/YYYY-MM-DD/
# {$hadoop.log.dir}/history/done/USER/YYYY/MM/DD
# {$hadoop.log.dir}/history/done/YYYY/MM/DD/USER

For the directory structure, I would go with option#1, because it is easy to 
maintain.  We can add more when needed.

We can have a cache in JobTracker to look up the history location for each 
jobid (can be moved HistoryServer when we move history to a separate server). 
We can have JT maintain the cache for last 20 days history (configurable).
Now, the file name of the history log file is <jobid>_<user>.log.  We have job 
id about 20 characters long, and if user name is about 25 characters, the 
jobhistory file name is of length about 50 bytes. For a given jobid, the cache 
entry in JT will be of size at most 100 bytes. 50,000 such entries would make 
it 5MB. 
We can have a configuration to limit the number entries in the cache, default 
value being 50,000.
Thus, the cache is controlled by the number of the days for which the cache is 
maintained and is also capped by number of entries in the cache.

If the history location is not present in the JT cache, JT history web ui does 
not show the page. 
An Interested user can call, the api Cluster.getJobHistoryUrl(JobID, boolean 
getFromDFS) to get the url from the DFS, if it is not present in JT.
We can add *bin/hadoop job -historyurl <jobid> * to get the historyurl for the 
jobid from JT cache. We can add another argument to the command to get the 
history url from DFS if it is not present in JT cache.
Then, HistoryViewer can be used to view the history on command line. 

Thoughts?

> Improve the way job history files are managed
> ---------------------------------------------
>
>                 Key: MAPREDUCE-323
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-323
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: jobtracker
>    Affects Versions: 0.21.0, 0.22.0
>            Reporter: Amar Kamat
>            Assignee: Amar Kamat
>            Priority: Critical
>
> Today all the jobhistory files are dumped in one _job-history_ folder. This 
> can cause problems when there is a need to search the history folder 
> (job-recovery etc). It would be nice if we group all the jobs under a _user_ 
> folder. So all the jobs for user _amar_ will go in _history-folder/amar/_. 
> Jobs can be categorized using various features like _jobid, date, jobname_ 
> etc but using _username_ will make the search much more efficient and also 
> will not result into namespace explosion. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-323) Improve the way job history files are managed

Reply via email to