[
https://issues.apache.org/jira/browse/MAPREDUCE-323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12876594#action_12876594
]
Amar Kamat commented on MAPREDUCE-323:
--------------------------------------
Few comments
# W.r.t your [comment | http://tinyurl.com/2aado36], we could very well use the
finishtime of the job. This is very well published in the job summary, stored
in the job status cache within jobtracker and later archived to
completed-job-status-store. Maybe we can reuse these features (i.e the job
status cache and status store).
# We should log jobhistory activities like
## jobhistory folder regex used
## jobid to foldername mappings
Logging will help in debugging and post mortem analysis.
# Formats can change across runs. How do we plan to take care of that. One
thing we can do it to have a unique folder per pattern for storing the files.
The (unique) folder-name should be based on the jobhistory structure pattern.
This mapping of jobhistory folder regex to the foldername should be logged.
Clients that need really old jobhistory files analyzed, will dig up the
jobhistory folder format, map it to the folder, provide the _username_, _jobid_
and _finishtime_ to get the file. The client can get the _username_ and
_finishtime_ by quering the JobTracker for the job status (via
completed-jobstatus-store). See _Future Steps #1_.
# How about keeping _N_ items in the top level directory and moving them to the
appropriate place only when the total item count crosses _N_.
Example (assume /done/%user/%jobid as the format and N=5)
## The first job gets added to /done/job1
## 5th job gets added to /done/job5
## 6th job gets added to /done/job6 and /done/job1 gets moves to
/done/user1/job1
## and so on
So the movement happens only on overflow. The benefit of this change is that
without any indexing, we can show the recent N jobs on the jobhistory webui.
This pattern can be enabled for all subfolders also. So if the jobhistory
format specified is %user/ then queries like '_give the recent 5 items all the
users_' can also be answered quickly.
# Webui should provide 2 views
## top/recent few (show jobs from the topmost level folder)
## browse-able view where YYYY/MM/DD etc is shows as it is. This can be
configurable and turned off for complicated structures like 00/00/00-99 etc,
which the users might now be able to make sense. Also there should be somekind
of widget in JobHistory that given _username_, _joibid_ and _finishtime_
provides the complete jobhistory filename. See _Future steps #2_.
# bq. .... He raised the issue that a practical cluster has more distinct users
than we would want to create DFS directories, especially if the directory
structure is further split on timestamps.
I would prefer username to be one of the configuration options. Since its
configurable, it can be turned off for clusters having lots of users.
Future steps :
# As of today, we have jobhistory files directly dumped in the done folder. We
might want to move these files in the format we want (for a good user
experience). Maybe some kind of offline admin tool can help here (maybe under
mradmin?). It might make sense to name the final jobhistory file (leaf-level)
as $username_$jobid_$finishtime. This will enable use to restructure job
history files across various formats.
# There should be someway to find out which regex/format was used given the
jobtracker start time (which is one of the components in jobid). To make it
easier for clients, maybe the log files related to jobhistory upadates can be
published or the JobTracker should be in a position to answer this.
Thoughts?
> Improve the way job history files are managed
> ---------------------------------------------
>
> Key: MAPREDUCE-323
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-323
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: jobtracker
> Affects Versions: 0.21.0, 0.22.0
> Reporter: Amar Kamat
> Assignee: Dick King
> Priority: Critical
>
> Today all the jobhistory files are dumped in one _job-history_ folder. This
> can cause problems when there is a need to search the history folder
> (job-recovery etc). It would be nice if we group all the jobs under a _user_
> folder. So all the jobs for user _amar_ will go in _history-folder/amar/_.
> Jobs can be categorized using various features like _jobid, date, jobname_
> etc but using _username_ will make the search much more efficient and also
> will not result into namespace explosion.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.