[
https://issues.apache.org/jira/browse/MAPREDUCE-323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879567#action_12879567
]
Dick King commented on MAPREDUCE-323:
-------------------------------------
I believe that it is agreed that we need a directory structure other than a
single directory holding all of the history files.
That being said, the question is how the directory tree should be organized.
The use cases are:
1: There is a job history web API, implemented by {{jobhistory.jsp}}, that
allows users to search the job history files to retrieve information on single
or multiple jobs meeting certain criteria. In particular, web users can search
for jobs with a certain user, and jobs whose job name contains a certain
substring.
After a search, the current API allows the user to page through the data. They
get told the total number of matching jobs, and they can browse pages of data,
with 100 jobs per page. They can access the first and last page from any page,
and from any pages they can access any of the previous or following five pages
[if there are that many].
2: During restart, we perform searches for specific quadruples of jobtracker
IDs, job-ID, username and jobname. This may be redundant but that's what we do
in the current code base.
3: I understand that some installations archive tranches of job history files
periodically, usually by date.
Here is how I support the claim that we support these use cases, with
considerable scaling and responsiveness improvements:
1: If I use a subdirectory structure based on jobtracker IDs and then dates and
then high order digits of the jobid serial number, then the performance of each
of these three usage cases can be improved. I described potential improvements
of use case 1 on 14/Jun/10 at 09:38 PM . To summarize, you will be able to
browse by dates and time ranges as well as by the other criteria, and
performance will be improved as we only search the subset of the directories we
need to satisfy the query or to present the first page of the results.
If we make changes along these lines we will no longer present to the user the
total number of matching jobs. One of the complaints that lead to this jira
is, after all, the possibility of a scaling problem if there are too many jobs.
2: Because of directory restrictions, the namenode will have to generate alot
fewer data, and there will be a lot less client side filtering as well if you
have directories consisting of only 1000 jobs [2000 files].
3: We could archive a day's results by harchiving a date subdirectory.
> Improve the way job history files are managed
> ---------------------------------------------
>
> Key: MAPREDUCE-323
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-323
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: jobtracker
> Affects Versions: 0.21.0, 0.22.0
> Reporter: Amar Kamat
> Assignee: Dick King
> Priority: Critical
>
> Today all the jobhistory files are dumped in one _job-history_ folder. This
> can cause problems when there is a need to search the history folder
> (job-recovery etc). It would be nice if we group all the jobs under a _user_
> folder. So all the jobs for user _amar_ will go in _history-folder/amar/_.
> Jobs can be categorized using various features like _jobid, date, jobname_
> etc but using _username_ will make the search much more efficient and also
> will not result into namespace explosion.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.