[ 
https://issues.apache.org/jira/browse/MAPREDUCE-323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12876594#action_12876594
 ] 

Amar Kamat commented on MAPREDUCE-323:
--------------------------------------

Few comments
# W.r.t your [comment | http://tinyurl.com/2aado36], we could very well use the 
finishtime of the job. This is very well published in the job summary, stored 
in the job status cache within jobtracker and later archived to 
completed-job-status-store. Maybe we can reuse these features (i.e the job 
status cache and status store).
# We should log jobhistory activities like 
  ## jobhistory folder regex used
  ## jobid to foldername mappings
  Logging will help in debugging and post mortem analysis.
# Formats can change across runs. How do we plan to take care of that. One 
thing we can do it to have a unique folder per pattern for storing the files. 
The (unique) folder-name should be based on the jobhistory structure pattern. 
This mapping of jobhistory folder regex to the foldername should be logged. 
  Clients that need really old jobhistory files analyzed, will dig up the 
jobhistory folder format, map it to the folder, provide the _username_, _jobid_ 
and _finishtime_ to get the file. The client can get the _username_ and 
_finishtime_ by quering the JobTracker for the job status (via 
completed-jobstatus-store). See _Future Steps #1_.
# How about keeping _N_ items in the top level directory and moving them to the 
appropriate place only when the total item count crosses _N_. 
  Example (assume /done/%user/%jobid as the format and N=5)
  ## The first job gets added to /done/job1
  ## 5th job gets added to /done/job5
  ## 6th job gets added to /done/job6 and /done/job1 gets moves to 
/done/user1/job1
  ## and so on
So the movement happens only on overflow. The benefit of this change is that 
without any indexing, we can show the recent N jobs on the jobhistory webui. 
This pattern can be enabled for all subfolders also. So if the jobhistory 
format specified is %user/ then queries like '_give the recent 5 items all the 
users_' can also be answered quickly.
# Webui should provide 2 views
   ## top/recent few (show jobs from the topmost level folder)
   ## browse-able view where YYYY/MM/DD etc is shows as it is. This can be 
configurable and turned off for complicated structures like 00/00/00-99 etc, 
which the users might now be able to make sense. Also there should be somekind 
of widget in JobHistory that given _username_, _joibid_ and _finishtime_ 
provides the complete jobhistory filename. See _Future steps #2_.
# bq. .... He raised the issue that a practical cluster has more distinct users 
than we would want to create DFS directories, especially if the directory 
structure is further split on timestamps.
I would prefer username to be one of the configuration options. Since its 
configurable, it can be turned off for clusters having lots of users.

Future steps :
# As of today, we have jobhistory files directly dumped in the done folder. We 
might want to move these files in the format we want (for a good user 
experience). Maybe some kind of offline admin tool can help here (maybe under 
mradmin?). It might make sense to name the final jobhistory file (leaf-level) 
as $username_$jobid_$finishtime. This will enable use to restructure job 
history files across various formats. 
# There should be someway to find out which regex/format was used given the 
jobtracker start time (which is one of the components in jobid). To make it 
easier for clients, maybe the log files related to jobhistory upadates can be 
published or the JobTracker should be in a position to answer this.
Thoughts? 

> Improve the way job history files are managed
> ---------------------------------------------
>
>                 Key: MAPREDUCE-323
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-323
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: jobtracker
>    Affects Versions: 0.21.0, 0.22.0
>            Reporter: Amar Kamat
>            Assignee: Dick King
>            Priority: Critical
>
> Today all the jobhistory files are dumped in one _job-history_ folder. This 
> can cause problems when there is a need to search the history folder 
> (job-recovery etc). It would be nice if we group all the jobs under a _user_ 
> folder. So all the jobs for user _amar_ will go in _history-folder/amar/_. 
> Jobs can be categorized using various features like _jobid, date, jobname_ 
> etc but using _username_ will make the search much more efficient and also 
> will not result into namespace explosion. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to