[jira] Issue Comment Edited: (HADOOP-3022) Fast Cluster Restart

Konstantin Shvachko (JIRA) Thu, 08 May 2008 14:57:24 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-3022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12594019#action_12594019
 ]


shv edited comment on HADOOP-3022 at 5/8/08 2:56 PM:
---------------------------------------------------------------------

h2. Estimates.
The name-node startup consists of 4 major steps:
# load fsimage
# load edits
# save new merged fsimage
# process block report

I am estimating the name-node startup time based on a 10 million objects image.
The estimates are based on my experiments with a real cluster.
The numbers are scaled proportionally (4.3)  to 10 mln objects.

Under the assumption that each file on average has 1.5 blocks 10 mln objects 
translates into 4 mln files and directories, and 6 mln blocks.
Other claster parameters are summarized in the following table.

|objects|10 mln|
|files & dirs| 4 mln|
|blocks| 6 mln|
|heap size| 3.275 GB|
|image size| 0.6 GB|
|image load time| 132 sec|
|image save time| 70 sec|
|edits size per day| 0.27 GB|
|edits load time| 84 sec|
|data-nodes| 500|
|blocks per node| 36,000|
|block report processing| 320 sec|
|total startup time| 10 min|

This leads to the total startup time of 14 minutes, out of which
|load fsimage| 22%|
|load edits| 14%|
|save new fsimage| 11%|
|process block reports| 53%|

h2. Optimization.
# VM memory heap parameters play key role in the startup process as discussed 
in HADOOP-3248.
It is highly recommended to set the initial heap size -Xms close to the maximum 
heap size because 
all that memory will be used by the name-node any way.
# Optimization of loading and saving is substantially related to reducing 
object allocation.
In case of loading, object allocations are unavoidable, so we will need to make 
sure
there are no intermediate allocations of temporary objects.
In case of saving, all object allocations should be eliminated, which is done 
in HADOOP-3248.
# During loading for each file or directory the name-node performs a lookup in 
the
namespace tree starting from the root in order to find the object's parent 
directory and then to insert the object into it.
We can take advantage here of that directory children are not interleaving with 
children of other directories in the image. 
So we can find a parent once and then add then include all its children without 
repeating the search.
# Block reporting is the most expensive single part of the startup process.
According to my experiments most of the processing time here goes into first 
adding all blocks into needed replication queue, 
and then removing them from the queue. During startup all blocks are guaranteed 
to be under-replicated in the beginning.
But most of them and in the regular case all of them will be removed from that 
list.
So in a sense we create dynamically a huge temporary structure just to make 
sure that all blocks have enough replicas.
In addition to that in pre HADOOP-2606 versions (before release 0.17) the 
replication monitor would start processing
 those under-replicated blocks, and try to assign nodes for copying blocks.
The structure works fine during regular operation because it contains only 
those blocks that are in fact under-replicated.
The processing of block reports goes 5 times faster if the addition to the 
needed replications queue is removed.
# We also should reduce the number of block lookups in the blocksMap. I counted 
5 lookups just in addstoredBlocks() while 
only one lookup is necessary because the name-space is locked and there is only 
thread that modifies it.

      was (Author: shv):
    h2. Estimates.
The name-node startup consists of 4 major steps:
# load fsimage
# load edits
# save new merged fsimage
# process block report

I am estimating the name-node startup time based on a 10 million objects image.
The estimates are based on my experiments with a real cluster.
The numbers are scaled proportionally (4.3)  to 10 mln objects.

Under the assumption that each file on average has 1.5 blocks 10 mln objects 
translates into 4 mln files and directories, and 6 mln blocks.
Other claster parameters are summarized in the following table.

|objects|10 mln|
|files & dirs| 4 mln|
|blocks| 6 mln|
|heap size| 3.275 GB|
|image size| 0.6 GB|
|image load time| 132 sec|
|image save time| 70 sec|
|edits size per day| 0.27 GB|
|edits load time| 311 sec|
|data-nodes| 500|
|blocks per node| 36,000|
|block report processing| 320 sec|
|total startup time| 14 min|

This leads to the total startup time of 14 minutes, out of which
|load fsimage| 16%|
|load edits| 37%|
|save new fsimage| 8%|
|process block reports| 39%|

h2. Optimization.
# VM memory heap parameters play key role in the startup process as discussed 
in HADOOP-3248.
It is highly recommended to set the initial heap size -Xms close to the maximum 
heap size because 
all that memory will be used by the name-node any way.
# Optimization of loading and saving is substantially related to reducing 
object allocation.
In case of loading, object allocations are unavoidable, so we will need to make 
sure
there are no intermediate allocations of temporary objects.
In case of saving, all object allocations should be eliminated, which is done 
in HADOOP-3248.
# During loading for each file or directory the name-node performs a lookup in 
the
namespace tree starting from the root in order to find the object's parent 
directory and then to insert the object into it.
We can take advantage here of that directory children are not interleaving with 
children of other directories in the image. 
So we can find a parent once and then add then include all its children without 
repeating the search.
# Block reporting is the most expensive single part of the startup process.
According to my experiments most of the processing time here goes into first 
adding all blocks into needed replication queue, 
and then removing them from the queue. During startup all blocks are guaranteed 
to be under-replicated in the beginning.
But most of them and in the regular case all of them will be removed from that 
list.
So in a sense we create dynamically a huge temporary structure just to make 
sure that all blocks have enough replicas.
In addition to that in pre HADOOP-2606 versions (before release 0.17) the 
replication monitor would start processing
 those under-replicated blocks, and try to assign nodes for copying blocks.
The structure works fine during regular operation because it contains only 
those blocks that are in fact under-replicated.
The processing of block reports goes 5 times faster if the addition to the 
needed replications queue is removed.
# We also should reduce the number of block lookups in the blocksMap. I counted 
5 lookups just in addstoredBlocks() while 
only one lookup is necessary because the name-space is locked and there is only 
thread that modifies it.
  
> Fast Cluster Restart
> --------------------
>
>                 Key: HADOOP-3022
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3022
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Robert Chansler
>            Assignee: Konstantin Shvachko
>             Fix For: 0.18.0
>
>
> This item introduces a discussion of how to reduce the time necessary to 
> start a large cluster from tens of minutes to a handful of minutes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-3022) Fast Cluster Restart

Reply via email to