[ 
https://issues.apache.org/jira/browse/HDFS-14617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16883604#comment-16883604
 ] 

Stephen O'Donnell commented on HDFS-14617:
------------------------------------------

This image is about 30GB in size and the md5 calculation took about 90 seconds 
and it must read the entire 30GB.

Previously, I tested the single thread inode load time, but I commented out the 
code which loads the actual inode into the map structure - it just reads the 
image, parses the protobuf and forms the inode object in memory. This took 
about 4 minutes.

So, as a very crude estimate, if md5 calculation was IO bound (and its usually 
CPU bound) at 90 seconds, then to do close to the same amount of disk reading, 
but adding in the protobuf parsing and object creation goes from 90 -> 240s.

I will also attach a flame chart that shows the single threaded inode load 
time. Unfortunately I only sampled this with jstack and not the async profiler, 
which would give better output.

In it, you can see the method 
"org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader$2.run" 
which is the root of the image loading, stating about 50% of the runtime. GC is 
stated at the other 50%, but I am not sure how accurate that is.

Then if you check the line for method 
"{color:#000000}org.apache.hadoop.hdfs.server.namenode.FsImageProto$INodeSection$INode.parseDelimitedFrom{color}",
 it is 43.47% of the 50%.

{color:#000000}org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINode{color}
 is only 5.32% in comparison.

If you then check the top right of the chart, you can see the time its blocked 
on the Java IO seems to be about 11% of the original 50. From this, it does 
seem as though protobuf is dominating for the inode section.

For inode directory there are more lookups and structures modified after the 
protobuf is loaded, so it is not as constrained by the protobuf loading, but it 
still can get a good speed up per my testing above by doing much of that in 
parallel (see inode_dir.svg for a single threaded flamechart of inode directory 
loading).

> Improve fsimage load time by writing sub-sections to the fsimage index
> ----------------------------------------------------------------------
>
>                 Key: HDFS-14617
>                 URL: https://issues.apache.org/jira/browse/HDFS-14617
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>            Reporter: Stephen O'Donnell
>            Assignee: Stephen O'Donnell
>            Priority: Major
>         Attachments: HDFS-14617.001.patch
>
>
> Loading an fsimage is basically a single threaded process. The current 
> fsimage is written out in sections, eg iNode, iNode_Directory, Snapshots, 
> Snapshot_Diff etc. Then at the end of the file, an index is written that 
> contains the offset and length of each section. The image loader code uses 
> this index to initialize an input stream to read and process each section. It 
> is important that one section is fully loaded before another is started, as 
> the next section depends on the results of the previous one.
> What I would like to propose is the following:
> 1. When writing the image, we can optionally output sub_sections to the 
> index. That way, a given section would effectively be split into several 
> sections, eg:
> {code:java}
>    inode_section offset 10 length 1000
>      inode_sub_section offset 10 length 500
>      inode_sub_section offset 510 length 500
>      
>    inode_dir_section offset 1010 length 1000
>      inode_dir_sub_section offset 1010 length 500
>      inode_dir_sub_section offset 1010 length 500
> {code}
> Here you can see we still have the original section index, but then we also 
> have sub-section entries that cover the entire section. Then a processor can 
> either read the full section in serial, or read each sub-section in parallel.
> 2. In the Image Writer code, we should set a target number of sub-sections, 
> and then based on the total inodes in memory, it will create that many 
> sub-sections per major image section. I think the only sections worth doing 
> this for are inode, inode_reference, inode_dir and snapshot_diff. All others 
> tend to be fairly small in practice.
> 3. If there are under some threshold of inodes (eg 10M) then don't bother 
> with the sub-sections as a serial load only takes a few seconds at that scale.
> 4. The image loading code can then have a switch to enable 'parallel loading' 
> and a 'number of threads' where it uses the sub-sections, or if not enabled 
> falls back to the existing logic to read the entire section in serial.
> Working with a large image of 316M inodes and 35GB on disk, I have a proof of 
> concept of this change working, allowing just inode and inode_dir to be 
> loaded in parallel, but I believe inode_reference and snapshot_diff can be 
> make parallel with the same technique.
> Some benchmarks I have are as follows:
> {code:java}
> Threads   1     2     3     4 
> --------------------------------
> inodes    448   290   226   189 
> inode_dir 326   211   170   161 
> Total     927   651   535   488 (MD5 calculation about 100 seconds)
> {code}
> The above table shows the time in seconds to load the inode section and the 
> inode_directory section, and then the total load time of the image.
> With 4 threads using the above technique, we are able to better than half the 
> load time of the two sections. With the patch in HDFS-13694 it would take a 
> further 100 seconds off the run time, going from 927 seconds to 388, which is 
> a significant improvement. Adding more threads beyond 4 has diminishing 
> returns as there are some synchronized points in the loading code to protect 
> the in memory structures.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to