[
https://issues.apache.org/jira/browse/MAPREDUCE-4680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13463143#comment-13463143
]
Sandy Ryza commented on MAPREDUCE-4680:
---------------------------------------
I just looked at the code again, and I think I misunderstood the first time, so
I wanted to make sure we're on the same page. Currently, all the yyyy/mm/dd
directories are gathered, then sorted in ascending order by time. Then we go
through and delete files until we reach a young enough directory, then halt. I
had thought that job history files inside dd/ directories that were too young
were being examined, but they are not.
The load on HDFS could be decreased further by, say, if the max age is 2 years,
and it's 2012, not looking at anything deeper in the 2011 dir (and same for
months). But would this be worthwhile? It would make a difference only if the
max history age were greater than a month (default is a week), in which case it
could save a listStatus for each month of age.
If not, I could still make it delete the old folders.
> Job history cleaner should only check timestamps of files in old enough
> directories
> -----------------------------------------------------------------------------------
>
> Key: MAPREDUCE-4680
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4680
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: jobhistoryserver
> Affects Versions: 2.0.0-alpha
> Reporter: Sandy Ryza
>
> Job history files are stored in yyyy/mm/dd folders. Currently, the job
> history cleaner checks the modification date of each file in every one of
> these folders to see whether it's past the maximum age. The load on HDFS
> could be reduced by only checking the ages of files in directories that are
> old enough, as determined by their name.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira