[jira] [Updated] (MAPREDUCE-323) Improve the way job history files are managed

Allen Wittenauer (JIRA) Thu, 11 Feb 2016 15:09:42 -0800

     [ 
https://issues.apache.org/jira/browse/MAPREDUCE-323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Allen Wittenauer updated MAPREDUCE-323:
---------------------------------------
    Release Note: 
This patch does four things:

* it changes the directory structure of the done directory that holds history 
logs for jobs that are completed,
* it builds toy databases for completed jobs, so we no longer have to scan 2N 
files on DFS to find out facts about the N jobs that have completed since the 
job tracker started [which can be hundreds of thousands of files in practical 
cases],
* it changes the job history browser to display more information and allow more 
filtering criteria, and
* it creates a new programmatic interface for finding files matching 
user-chosen criteria. This allows users to no longer be concerned with our 
methods of storing them, in turn allowing us to change those at will.

The new API described above, which can be used to programmatically obtain 
history file PATHs given search criteria, is described below:

    package org.apache.hadoop.mapreduce.jobhistory;
    ...

    // this interface is within O.A.H.mapreduce.jobhistory.JobHistory:

    // holds information about one job hostory log in the done 
    //   job history logs
    public static class JobHistoryJobRecord {
       public Path getPath() { ... }
       public String getJobIDString() { ... }
       public long getSubmitTime() { ... }
       public String getUserName() { ... }
       public String getJobName() { ... }
    }

    public class JobHistoryRecordRetriever implements 
Iterator<JobHistoryJobRecord> {
       // usual Interface methods -- remove() throws 
UnsupportedOperationException
       // returns the number of calls to next() that will succeed
       public int numMatches() { ... }
    }

    // returns a JobHistoryRecordRetriever that delivers all Path's of job 
matching job history files,
    // in no particular order.  Any criterion that is null or the empty string 
does not constrain.
    // All criteria that are specified are applied conjunctively, except that 
if there's more than
    // one date you retrieve all Path's matching ANY date.
    // soughtUser and soughtJobid must match exactly.
    // soughtJobName can match the entire job name or any substring.
    // dates must be in the format exactly MM/DD/YYYY .  
    // Dates' leading digits must be 2's .  We're incubating a Y3K problem.
    public JobHistoryRecordRetriever getMatchingJob
        (String soughtUser, String soughtJobName, String[] dateStrings, String 
soughtJobid)
      throws IOException 



  was:
This patch does four things:

    * it changes the directory structure of the done directory that holds 
history logs for jobs that are completed,
    * it builds toy databases for completed jobs, so we no longer have to scan 
2N files on DFS to find out facts about the N jobs that have completed since 
the job tracker started [which can be hundreds of thousands of files in 
practical cases],
    * it changes the job history browser to display more information and allow 
more filtering criteria, and
    * it creates a new programmatic interface for finding files matching 
user-chosen criteria. This allows users to no longer be concerned with our 
methods of storing them, in turn allowing us to change those at will.

The new API described above, which can be used to programmatically obtain 
history file PATHs given search criteria, is described below:

    package org.apache.hadoop.mapreduce.jobhistory;
    ...

    // this interface is within O.A.H.mapreduce.jobhistory.JobHistory:

    // holds information about one job hostory log in the done 
    //   job history logs
    public static class JobHistoryJobRecord {
       public Path getPath() { ... }
       public String getJobIDString() { ... }
       public long getSubmitTime() { ... }
       public String getUserName() { ... }
       public String getJobName() { ... }
    }

    public class JobHistoryRecordRetriever implements 
Iterator<JobHistoryJobRecord> {
       // usual Interface methods -- remove() throws 
UnsupportedOperationException
       // returns the number of calls to next() that will succeed
       public int numMatches() { ... }
    }

    // returns a JobHistoryRecordRetriever that delivers all Path's of job 
matching job history files,
    // in no particular order.  Any criterion that is null or the empty string 
does not constrain.
    // All criteria that are specified are applied conjunctively, except that 
if there's more than
    // one date you retrieve all Path's matching ANY date.
    // soughtUser and soughtJobid must match exactly.
    // soughtJobName can match the entire job name or any substring.
    // dates must be in the format exactly MM/DD/YYYY .  
    // Dates' leading digits must be 2's .  We're incubating a Y3K problem.
    public JobHistoryRecordRetriever getMatchingJob
        (String soughtUser, String soughtJobName, String[] dateStrings, String 
soughtJobid)
      throws IOException 




> Improve the way job history files are managed
> ---------------------------------------------
>
>                 Key: MAPREDUCE-323
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-323
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: jobtracker
>    Affects Versions: 0.21.0, 0.22.0
>            Reporter: Amar Kamat
>            Assignee: Dick King
>            Priority: Critical
>             Fix For: 0.20.203.0
>
>         Attachments: MR323--2010-08-20--1533.patch, 
> MR323--2010-08-25--1632.patch, MR323--2010-08-27--1359.patch, 
> MR323--2010-08-27--1613.patch, MR323--2010-09-07--1636.patch
>
>
> Today all the jobhistory files are dumped in one _job-history_ folder. This 
> can cause problems when there is a need to search the history folder 
> (job-recovery etc). It would be nice if we group all the jobs under a _user_ 
> folder. So all the jobs for user _amar_ will go in _history-folder/amar/_. 
> Jobs can be categorized using various features like _jobid, date, jobname_ 
> etc but using _username_ will make the search much more efficient and also 
> will not result into namespace explosion. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MAPREDUCE-323) Improve the way job history files are managed

Reply via email to