[ https://issues.apache.org/jira/browse/HDFS-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13044129#comment-13044129 ]
Todd Lipcon commented on HDFS-2003: ----------------------------------- h3. Test stup I ran performance tests with an fsimage/edits pair I had from a real life cluster. The fsimage is about ~2G and has 12.5M files, and the edit log is exactly 2GB (I truncated it with dd to that length). I ran the NN with the following JVM options: -Xms14g -Xmx14g -XX:+UseCompressedOops. h3. With Parallel (default) GC: I loaded the edit log 3 times each with the patch and without the patch from a local SATA disk. Without the patch, the logs loaded in 84 seconds (consistent across the 3 runs). With the patch, it loaded in 87s, consistent across the three runs. h3. With CMS GC: I then added the JVM option: -XX:+UseConcMarkSweepGC, since that's more likely the GC in use on most large clusters. With the patch: Loaded in 86 seconds and incurred 213 young generation collections while loading the edit log, which added up to a total of 2.208 seconds in young gen GC. Without the patch: 84 seconds, 211 young gen GCs, adding up to 2.174 seconds. h3. Summary The patch seems to have a very marginal impact on amount of time spent in GC, which makes sense since the objects are very short-lived and young-generation GC time is proportional to live object size, not garbage size. The patch seems to have about a 3-4% negative impact on overall wall clock time of loading the log. Do you guys think this is acceptable? In most of the clusters I see, edit logs tend to be much smaller than this, and startup time is dominated by loading the image and collecting block reports, not edits replay. So, I tend to think the improved code cleanliness of this patch is worth the perf hit. > Separate FSEditLog reading logic from editLog memory state building logic > ------------------------------------------------------------------------- > > Key: HDFS-2003 > URL: https://issues.apache.org/jira/browse/HDFS-2003 > Project: Hadoop HDFS > Issue Type: Improvement > Affects Versions: Edit log branch (HDFS-1073) > Reporter: Ivan Kelly > Assignee: Ivan Kelly > Fix For: Edit log branch (HDFS-1073) > > Attachments: HDFS-2003.diff, HDFS-2003.diff, HDFS-2003.diff > > > Currently FSEditLogLoader has code for reading from an InputStream > interleaved with code which updates the FSNameSystem and FSDirectory. This > makes it difficult to read an edit log without having a whole load of other > object initialised, which is problematic if you want to do things like count > how many transactions are in a file etc. > This patch separates the reading of the stream and the building of the memory > state. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira