[ 
https://issues.apache.org/jira/browse/HDFS-1780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13014335#comment-13014335
 ] 

Matt Foley commented on HDFS-1780:
----------------------------------

I reviewed the logic currently in place that decides whether it is necessary to 
write out a new FSImage.  Basically, within FSImage.recoverTransitionRead(), 
one of three methods is called:
* doUpgrade() - which always writes out new FSImage, before renaming "tmp" to 
"previous"
* doImportCheckpoint() - which always writes out new FSImage, while using the 
imported checkpointTime
* loadFSImage() - which will request saveNamespace under any of these 
conditions:
** if missing version file, indicates directory was just formatted
** if checkpointTime <= 0, indicates invalid or missing checkpoint
** if there was more than one checkpointTime recorded
** if previously interrupted checkpoint is detected
** if the read-in ImageVersion != the current LAYOUT_VERSION for this code base
** if latestNameCheckpointTime > latestEditsCheckpointTime, indicates we should 
discard the edits by saving new image
** if loadFSEdits() > 0, indicates either "edits" or "edits.new" existed and 
had ANY edit records, or had logVersion != the current LAYOUT_VERSION for this 
code base.

It seems to me that only the last item is a problem.  Just because there were 
SOME edit records, doesn't mean it is worth delaying startup while it writes a 
new checkpoint.  However, it appears the current code will tolerate only a 
single roll-over of edits logs (from "edits" to "edits.new"), and cannot 
combine two edit logs into one.  So we can't just accumulate edits files over 
multiple startups.

> reduce need to rewrite fsimage on statrtup
> ------------------------------------------
>
>                 Key: HDFS-1780
>                 URL: https://issues.apache.org/jira/browse/HDFS-1780
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>            Reporter: Daryn Sharp
>
> On startup, the namenode will read the fs image, apply edits, then rewrite 
> the fs image.  This requires a non-trivial amount of time for very large 
> directory structures.  Perhaps the namenode should employ some logic to 
> decide that the edits are simple enough that it doesn't warrant rewriting the 
> image back out to disk.
> A few ideas:
> Use the size of the edit logs, if the size is below a threshold, assume it's 
> cheaper to reprocess the edit log instead of writing the image back out.
> Time the processing of the edits and if the time is below a defined 
> threshold, the image isn't rewritten.
> Timing the reading of the image, and the processing of the edits.  Base the 
> decision on the time it would take to write the image (a multiplier is 
> applied to the read time?) versus the time it would take to reprocess the 
> edits.  If a certain threshold (perhaps percentage or expected time to 
> rewrite) is exceeded, rewrite the image.
> Somethingalong the lines of the last suggestion may allow for defaults that 
> adapt for any size cluster, thus eliminating the need to keep tweaking a 
> cluster's settings based on its size.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to