[
https://issues.apache.org/jira/browse/MAPREDUCE-323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877614#action_12877614
]
Chris Douglas commented on MAPREDUCE-323:
-----------------------------------------
The scope of this issue has not been well defined. The designs are arguing
about the correct subset of a database to implement for JobHistory, leaving a
wide range of known (and as Allen points out, unknown) use cases ill served.
This will not converge quickly.
For purposes of consensus, this issue is a bug; the _existing_ functionality is
not handled efficiently. It should go without saying that the design should not
be over-specific to today's use cases, but the issue's focus should remain on
solving the problems cited and servicing the use cases already in the system.
This is a misbehaving component, not a project implementing a small database in
HDFS. Perhaps the title should change to reflect this.
There are 3 operations to support (please amend as necessary):
# Lookup by JobID. This should not be worse than O\(log n) (and should be
O\(1)), as it is a frequent operation.
# Find a set of jobs run by a particular user
# Find a set of jobs with names matching a regex
(2) and (3) can require a scan, but the cost should be bounded. If there are
common operator activities (like archiving old history, etc) then the layout
should support that, but arbitrary queries are out of scope.
The problems with the flat hierarchy are, obviously, the cost of listing files
both in the JobTracker and NameNode. This can be ameliorated, somewhat, by
HDFS-1091 and HDFS-985, but further optimizations/caching are possible if one
can assume that recent entries are more relevant.
Dick/[Doug|https://issues.apache.org/jira/browse/MAPREDUCE-323?focusedCommentId=12771987&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12771987]'s
format looks sound to me. Amar identified many complexities in implementing
the configurable-schema, mini-database proposal and in my opinion: while the
solutions are feasible, the virtues of a simpler fix for this issue outweigh
the costs of solving those problems.
I particularly like the idea of bounding scans of JobHistory to _n_ entries,
unless the user requests a deeper search. Caching recent entries, metadata
about which subdirectories are sufficent for _n_ entries, etc. are all
reasonable optimizations, but adopting the new layout should be sufficient for
this issue. Agreed?
> Improve the way job history files are managed
> ---------------------------------------------
>
> Key: MAPREDUCE-323
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-323
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: jobtracker
> Affects Versions: 0.21.0, 0.22.0
> Reporter: Amar Kamat
> Assignee: Dick King
> Priority: Critical
>
> Today all the jobhistory files are dumped in one _job-history_ folder. This
> can cause problems when there is a need to search the history folder
> (job-recovery etc). It would be nice if we group all the jobs under a _user_
> folder. So all the jobs for user _amar_ will go in _history-folder/amar/_.
> Jobs can be categorized using various features like _jobid, date, jobname_
> etc but using _username_ will make the search much more efficient and also
> will not result into namespace explosion.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.