[ 
https://issues.apache.org/jira/browse/MAPREDUCE-323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877614#action_12877614
 ] 

Chris Douglas commented on MAPREDUCE-323:
-----------------------------------------

The scope of this issue has not been well defined. The designs are arguing 
about the correct subset of a database to implement for JobHistory, leaving a 
wide range of known (and as Allen points out, unknown) use cases ill served. 
This will not converge quickly.

For purposes of consensus, this issue is a bug; the _existing_ functionality is 
not handled efficiently. It should go without saying that the design should not 
be over-specific to today's use cases, but the issue's focus should remain on 
solving the problems cited and servicing the use cases already in the system. 
This is a misbehaving component, not a project implementing a small database in 
HDFS. Perhaps the title should change to reflect this.

There are 3 operations to support (please amend as necessary):
# Lookup by JobID. This should not be worse than O\(log n) (and should be 
O\(1)), as it is a frequent operation.
# Find a set of jobs run by a particular user
# Find a set of jobs with names matching a regex

(2) and (3) can require a scan, but the cost should be bounded. If there are 
common operator activities (like archiving old history, etc) then the layout 
should support that, but arbitrary queries are out of scope.

The problems with the flat hierarchy are, obviously, the cost of listing files 
both in the JobTracker and NameNode. This can be ameliorated, somewhat, by 
HDFS-1091 and HDFS-985, but further optimizations/caching are possible if one 
can assume that recent entries are more relevant.

Dick/[Doug|https://issues.apache.org/jira/browse/MAPREDUCE-323?focusedCommentId=12771987&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12771987]'s
 format looks sound to me. Amar identified many complexities in implementing 
the configurable-schema, mini-database proposal and in my opinion: while the 
solutions are feasible, the virtues of a simpler fix for this issue outweigh 
the costs of solving those problems.

I particularly like the idea of bounding scans of JobHistory to _n_ entries, 
unless the user requests a deeper search. Caching recent entries, metadata 
about which subdirectories are sufficent for _n_ entries, etc. are all 
reasonable optimizations, but adopting the new layout should be sufficient for 
this issue. Agreed?

> Improve the way job history files are managed
> ---------------------------------------------
>
>                 Key: MAPREDUCE-323
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-323
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: jobtracker
>    Affects Versions: 0.21.0, 0.22.0
>            Reporter: Amar Kamat
>            Assignee: Dick King
>            Priority: Critical
>
> Today all the jobhistory files are dumped in one _job-history_ folder. This 
> can cause problems when there is a need to search the history folder 
> (job-recovery etc). It would be nice if we group all the jobs under a _user_ 
> folder. So all the jobs for user _amar_ will go in _history-folder/amar/_. 
> Jobs can be categorized using various features like _jobid, date, jobname_ 
> etc but using _username_ will make the search much more efficient and also 
> will not result into namespace explosion. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to