[ 
https://issues.apache.org/jira/browse/HDFS-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881489#action_12881489
 ] 

Dmytro Molkov commented on HDFS-1071:
-------------------------------------

Well, what I mean by the parent thread holding the lock is the following:

the saveNamespace method is synchronized in the FSNamesystem and currently 
while holding this lock, the handler thread walks the tree N times and writes N 
files, so in a way we assume that the tree is guarded from all the 
modifications by the FSNamesystem lock.

The same is true for the patch, except in this case we are walking the tree by 
N different threads. But operating under the same assumptions that while we are 
holding the FSNamesystem lock the tree is not being modified, and the handler 
thread is waiting for all worker threads to finish writing to their files 
before returning from the section synchronized on FSNamesystem.

We just deployed this patch internally to our production cluster:

2010-06-22 10:12:59,714 INFO org.apache.hadoop.hdfs.server.common.Storage: 
Image file of size 11906663754 saved in 140 seconds.
2010-06-22 10:13:50,626 INFO org.apache.hadoop.hdfs.server.common.Storage: 
Image file of size 11906663754 saved in 191 seconds.

This saved us 140 seconds on the current image.

As far as both copies being on the same drive is concerned - I guess this patch 
will not give much of an improvement.
However I am not sure there is much value in storing two copies of the image on 
the same drive?
Please correct me if I am wrong, but I thought that multiple copies of the 
image should theoretically be stored on different drives to help in case of 
drive failure (or on a filer to protect against machine dying), and storing two 
copies on the same drive only helps with file corruption (accidental deletion) 
and that is a weak argument to have multiple copies on one physical drive?

I like your approach with one thread doing serialization and others doing 
writes, but it seems like it is a lot more complicated than the one in this 
patch.
Because I am simply executing one call in a new born thread, while with 
serializer-writer approach there will be more implementation questions, like 
what to do with multiple writers that consume their queues at different speeds. 
You cannot grow the queue indefinitely, since the namenode will simply run out 
of memory, on the other hand you might want to write things out to faster 
consumers as quickly as possible.
And the main benefit I see is only doing serialization of a tree once, but 
since we are holding the FSNamesystem lock at that time the NameNode doesn't do 
much anyways, it is also not worse than what was in place before that 
(serialization was taking place once per image location).

> savenamespace should write the fsimage to all configured fs.name.dir in 
> parallel
> --------------------------------------------------------------------------------
>
>                 Key: HDFS-1071
>                 URL: https://issues.apache.org/jira/browse/HDFS-1071
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: name-node
>            Reporter: dhruba borthakur
>            Assignee: Dmytro Molkov
>         Attachments: HDFS-1071.2.patch, HDFS-1071.3.patch, HDFS-1071.4.patch, 
> HDFS-1071.5.patch, HDFS-1071.patch
>
>
> If you have a large number of files in HDFS, the fsimage file is very big. 
> When the namenode restarts, it writes a copy of the fsimage to all 
> directories configured in fs.name.dir. This takes a long time, especially if 
> there are many directories in fs.name.dir. Make the NN write the fsimage to 
> all these directories in parallel.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to