[ https://issues.apache.org/jira/browse/HADOOP-1876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alejandro Abdelnur updated HADOOP-1876: --------------------------------------- Attachment: patch1876.txt The CompletedJobStatusStore class performs 3 tasks: * persists in DFS the status/profile/counters/completion-evens of a JobInProgress instance * reads from DFS, giving the jobID the status/profile/counters/completion-evens of a job * runs a daemon thread that once an hour cleans up persisted jobs that exceeded their storage time It is configured with 2 properties: * the DFS directory where to persist the jobs (default: '/jobtracker/jobsInfo') * the storage time job info must be kept in DFS before is cleaned up (default: 0 - which means no storage at all) --- Changes to the JobTracker: The JobTracker creates a CompletedJobStatusStore at initialialization time. When the 'finalizeJob()' method is called the CompletedJobStatusStore 'store()' method is called to persist the job information. The getJobProfile()/getJobStatus()/getCounters()/getTaskCompletionEvents() call the corresponding get*() method from the CompletedJobStatusStore if the in memory queues don't have information about the request job ID. --- It includes a testcase to test the persistency of job info across JobTracker restarts --- As job ID include the JobTracker startup timestamp there is no risk of collision on jobIDs. As the directory for persisting the job info is configurable, the same DFS can be used by multiple JobTrackers without any risk of collision of job IDs from different JobTrackers in the unlikely case their startup timestamp is identical. --- The default behavior is 0 persistency time for job info, thus nothing is written to DFS on the normal configuration and the store()/read() methods are NOP and the daemon thread is not started. > Persisting completed jobs status > -------------------------------- > > Key: HADOOP-1876 > URL: https://issues.apache.org/jira/browse/HADOOP-1876 > Project: Hadoop > Issue Type: Improvement > Components: mapred > Environment: all > Reporter: Alejandro Abdelnur > Priority: Critical > Fix For: 0.16.0 > > Attachments: patch1876.txt > > > Currently the JobTracker keeps information about completed jobs in memory. > This information is flushed from the cache when it has outlived > (#RETIRE_JOB_INTERVAL) or because the limit of completed jobs in memory has > been reach (#MAX_COMPLETE_USER_JOBS_IN_MEMORY). > Also, if the JobTracker is restarted (due to being recycled or due to a > crash) information about completed jobs is lost. > If any of the above scenarios happens before the job information is queried > by a hadoop client (normally the job submitter or a monitoring component) > there is no way to obtain such information. > A way to avoid this is the JobTracker to persist in DFS the completed jobs > information upon job completion. This would be done at the time the job is > moved to the completed jobs queue. Then when querying the JobTracker for > information about a completed job, if it is not found in the memory queue, a > lookup in DFS would be done to retrieve the completed job information. > A directory in DFS (under mapred/system) would be used to persist completed > job information, for each completed job there would be a directory with the > job ID, within that directory all the information about the job: status, > jobprofile, counters and completion events. > A configuration property will indicate for how log persisted job information > should be kept in DFS. After such period it will be cleaned up automatically. > This improvement would not introduce API changes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.