[ https://issues.apache.org/jira/browse/MAPREDUCE-323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771987#action_12771987 ]
Doug Cutting commented on MAPREDUCE-323: ---------------------------------------- > Options for the directory structure of the history files are Nick Rettinghouse, Tim Williamson, and Rajiv Chittajallu all suggested a preference for per-hour directories, in particular, USER/YYYY/MM/DD/HH, an option you did not list. Should we perhaps err on the side of a deeper structure, to ensure that we don't have to re-structure things again? I like the idea of a cache of recent jobs in the JobTracker. This can be initialized by walking this directory tree, then maintained incrementally. However implementing Cluster.getJobHistoryUrl() would be expensive for archived jobs, since the jobtracker must search the entire directory tree. Perhaps the directory structure should instead be based purely on the job ID? E.g., something like: jobtrackerstarttime/00/00/00 jobtrackerstarttime/00/00/01 ... jobtrackerstarttime/00/00/99 jobtrackerstarttime/00/01/00 etc. Only if a jobtracker ran more than 1M jobs would its top-level directory have more than 100 entries. Constructing the cache of recent jobs would be fast, as would Cluster.getJobHistoryUrl(JobID). Access to jobs in the cache by user id, date, etc. could be fast, since the cache is in memory. Access to older jobs by user id, date, etc. would not be supported. As an enhancement, we might later place index files in the higher-level directories, listing job ids sorted by username, date, etc. These might be written to leaf directories after the 100th job is added to a directory, and to non-leaves after the 10,000th job is added, etc. They could be generated from the cache. With such indexes, user and time-based queries to the archives could be resolved in logn time. > Improve the way job history files are managed > --------------------------------------------- > > Key: MAPREDUCE-323 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-323 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: jobtracker > Affects Versions: 0.21.0, 0.22.0 > Reporter: Amar Kamat > Assignee: Amareshwari Sriramadasu > Priority: Critical > > Today all the jobhistory files are dumped in one _job-history_ folder. This > can cause problems when there is a need to search the history folder > (job-recovery etc). It would be nice if we group all the jobs under a _user_ > folder. So all the jobs for user _amar_ will go in _history-folder/amar/_. > Jobs can be categorized using various features like _jobid, date, jobname_ > etc but using _username_ will make the search much more efficient and also > will not result into namespace explosion. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.