Stephen O'Donnell created HDFS-14617:
----------------------------------------

             Summary: Improve fsimage load time by writing sub-sections to the 
fsimage index
                 Key: HDFS-14617
                 URL: https://issues.apache.org/jira/browse/HDFS-14617
             Project: Hadoop HDFS
          Issue Type: Improvement
          Components: namenode
            Reporter: Stephen O'Donnell
            Assignee: Stephen O'Donnell


Loading an fsimage is basically a single threaded process. The current fsimage 
is written out in sections, eg iNode, iNode_Directory, Snapshots, Snapshot_Diff 
etc. Then at the end of the file, an index is written that contains the offset 
and length of each section. The image loader code uses this index to initialize 
an input stream to read and process each section. It is important that one 
section is fully loaded before another is started, as the next section depends 
on the results of the previous one.

What I would like to propose is the following:

1. When writing the image, we can optionally output sub_sections to the index. 
That way, a given section would effectively be split into several sections, eg:
{code:java}
   inode_section offset 10 length 1000
     inode_sub_section offset 10 length 500
     inode_sub_section offset 510 length 500
     
   inode_dir_section offset 1010 length 1000
     inode_dir_sub_section offset 1010 length 500
     inode_dir_sub_section offset 1010 length 500
{code}
Here you can see we still have the original section index, but then we also 
have sub-section entries that cover the entire section. Then a processor can 
either read the full section in serial, or read each sub-section in parallel.

2. In the Image Writer code, we should set a target number of sub-sections, and 
then based on the total inodes in memory, it will create that many sub-sections 
per major image section. I think the only sections worth doing this for are 
inode, inode_reference, inode_dir and snapshot_diff. All others tend to be 
fairly small in practice.

3. If there are under some threshold of inodes (eg 10M) then don't bother with 
the sub-sections as a serial load only takes a few seconds at that scale.

4. The image loading code can then have a switch to enable 'parallel loading' 
and a 'number of threads' where it uses the sub-sections, or if not enabled 
falls back to the existing logic to read the entire section in serial.

Working with a large image of 316M inodes and 35GB on disk, I have a proof of 
concept of this change working, allowing just inode and inode_dir to be loaded 
in parallel, but I believe inode_reference and snapshot_diff can be make 
parallel with the same technique.

Some benchmarks I have are as follows:
{code:java}
Threads   1     2     3     4 
--------------------------------
inodes    448   290   226   189 
inode_dir 326   211   170   161 
Total     927   651   535   488 (MD5 calculation about 100 seconds)
{code}
The above table shows the time in seconds to load the inode section and the 
inode_directory section, and then the total load time of the image.

With 4 threads using the above technique, we are able to better than half the 
load time of the two sections. With the patch in HDFS-13694 it would take a 
further 100 seconds off the run time, going from 927 seconds to 388, which is a 
significant improvement. Adding more threads beyond 4 has diminishing returns 
as there are some synchronized points in the loading code to protect the in 
memory structures.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Reply via email to