[ 
https://issues.apache.org/jira/browse/HDFS-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13044129#comment-13044129
 ] 

Todd Lipcon commented on HDFS-2003:
-----------------------------------

h3. Test stup
I ran performance tests with an fsimage/edits pair I had from a real life 
cluster. The fsimage is about ~2G and has 12.5M files, and the edit log is 
exactly 2GB (I truncated it with dd to that length). I ran the NN with the 
following JVM options: -Xms14g -Xmx14g -XX:+UseCompressedOops. 

h3. With Parallel (default) GC:
I loaded the edit log 3 times each with the patch and without the patch from a 
local SATA disk.

Without the patch, the logs loaded in 84 seconds (consistent across the 3 
runs). With the patch, it loaded in 87s, consistent across the three runs.

h3. With CMS GC:
I then added the JVM option: -XX:+UseConcMarkSweepGC, since that's more likely 
the GC in use on most large clusters.

With the patch: Loaded in 86 seconds and incurred 213 young generation 
collections while loading the edit log, which added up to a total of 2.208 
seconds in young gen GC.
Without the patch: 84 seconds, 211 young gen GCs, adding up to 2.174 seconds.

h3. Summary

The patch seems to have a very marginal impact on amount of time spent in GC, 
which makes sense since the objects are very short-lived and young-generation 
GC time is proportional to live object size, not garbage size. The patch seems 
to have about a 3-4% negative impact on overall wall clock time of loading the 
log.


Do you guys think this is acceptable? In most of the clusters I see, edit logs 
tend to be much smaller than this, and startup time is dominated by loading the 
image and collecting block reports, not edits replay. So, I tend to think the 
improved code cleanliness of this patch is worth the perf hit.

> Separate FSEditLog reading logic from editLog memory state building logic
> -------------------------------------------------------------------------
>
>                 Key: HDFS-2003
>                 URL: https://issues.apache.org/jira/browse/HDFS-2003
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>    Affects Versions: Edit log branch (HDFS-1073)
>            Reporter: Ivan Kelly
>            Assignee: Ivan Kelly
>             Fix For: Edit log branch (HDFS-1073)
>
>         Attachments: HDFS-2003.diff, HDFS-2003.diff, HDFS-2003.diff
>
>
> Currently FSEditLogLoader has code for reading from an InputStream 
> interleaved with code which updates the FSNameSystem and FSDirectory. This 
> makes it difficult to read an edit log without having a whole load of other 
> object initialised, which is problematic if you want to do things like count 
> how many transactions are in a file etc. 
> This patch separates the reading of the stream and the building of the memory 
> state. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to