[
https://issues.apache.org/jira/browse/HADOOP-3022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12594019#action_12594019
]
shv edited comment on HADOOP-3022 at 5/8/08 5:01 PM:
---------------------------------------------------------------------
h2. Estimates.
The name-node startup consists of 4 major steps:
# load fsimage
# load edits
# save new merged fsimage
# process block report
I am estimating the name-node startup time based on a 10 million objects image.
The estimates are based on my experiments with a real cluster.
The numbers are scaled proportionally (4.3) to 10 mln objects.
Under the assumption that each file on average has 1.5 blocks 10 mln objects
translates into 4 mln files and directories, and 6 mln blocks.
Other claster parameters are summarized in the following table.
|objects|10 mln|
|files & dirs| 4 mln|
|blocks| 6 mln|
|heap size| 3.275 GB|
|image size| 0.6 GB|
|image load time| 132 sec|
|image save time| 70 sec|
|edits size per day| 0.27 GB|
|edits load time| 84 sec|
|data-nodes| 500|
|blocks per node| 36,000|
|block report processing| 320 sec|
|total startup time| 10 min|
This leads to the total startup time of 10 minutes, out of which
|load fsimage| 22%|
|load edits| 14%|
|save new fsimage| 11%|
|process block reports| 53%|
h2. Optimization.
# VM memory heap parameters play key role in the startup process as discussed
in HADOOP-3248.
It is highly recommended to set the initial heap size -Xms close to the maximum
heap size because
all that memory will be used by the name-node any way.
# Optimization of loading and saving is substantially related to reducing
object allocation.
In case of loading, object allocations are unavoidable, so we will need to make
sure
there are no intermediate allocations of temporary objects.
In case of saving, all object allocations should be eliminated, which is done
in HADOOP-3248.
# During loading for each file or directory the name-node performs a lookup in
the
namespace tree starting from the root in order to find the object's parent
directory and then to insert the object into it.
We can take advantage here of that directory children are not interleaving with
children of other directories in the image.
So we can find a parent once and then add then include all its children without
repeating the search.
# Block reporting is the most expensive single part of the startup process.
According to my experiments most of the processing time here goes into first
adding all blocks into needed replication queue,
and then removing them from the queue. During startup all blocks are guaranteed
to be under-replicated in the beginning.
But most of them and in the regular case all of them will be removed from that
list.
So in a sense we create dynamically a huge temporary structure just to make
sure that all blocks have enough replicas.
In addition to that in pre HADOOP-2606 versions (before release 0.17) the
replication monitor would start processing
those under-replicated blocks, and try to assign nodes for copying blocks.
The structure works fine during regular operation because it contains only
those blocks that are in fact under-replicated.
The processing of block reports goes 5 times faster if the addition to the
needed replications queue is removed.
# We also should reduce the number of block lookups in the blocksMap. I counted
5 lookups just in addstoredBlocks() while
only one lookup is necessary because the name-space is locked and there is only
thread that modifies it.
was (Author: shv):
h2. Estimates.
The name-node startup consists of 4 major steps:
# load fsimage
# load edits
# save new merged fsimage
# process block report
I am estimating the name-node startup time based on a 10 million objects image.
The estimates are based on my experiments with a real cluster.
The numbers are scaled proportionally (4.3) to 10 mln objects.
Under the assumption that each file on average has 1.5 blocks 10 mln objects
translates into 4 mln files and directories, and 6 mln blocks.
Other claster parameters are summarized in the following table.
|objects|10 mln|
|files & dirs| 4 mln|
|blocks| 6 mln|
|heap size| 3.275 GB|
|image size| 0.6 GB|
|image load time| 132 sec|
|image save time| 70 sec|
|edits size per day| 0.27 GB|
|edits load time| 84 sec|
|data-nodes| 500|
|blocks per node| 36,000|
|block report processing| 320 sec|
|total startup time| 10 min|
This leads to the total startup time of 14 minutes, out of which
|load fsimage| 22%|
|load edits| 14%|
|save new fsimage| 11%|
|process block reports| 53%|
h2. Optimization.
# VM memory heap parameters play key role in the startup process as discussed
in HADOOP-3248.
It is highly recommended to set the initial heap size -Xms close to the maximum
heap size because
all that memory will be used by the name-node any way.
# Optimization of loading and saving is substantially related to reducing
object allocation.
In case of loading, object allocations are unavoidable, so we will need to make
sure
there are no intermediate allocations of temporary objects.
In case of saving, all object allocations should be eliminated, which is done
in HADOOP-3248.
# During loading for each file or directory the name-node performs a lookup in
the
namespace tree starting from the root in order to find the object's parent
directory and then to insert the object into it.
We can take advantage here of that directory children are not interleaving with
children of other directories in the image.
So we can find a parent once and then add then include all its children without
repeating the search.
# Block reporting is the most expensive single part of the startup process.
According to my experiments most of the processing time here goes into first
adding all blocks into needed replication queue,
and then removing them from the queue. During startup all blocks are guaranteed
to be under-replicated in the beginning.
But most of them and in the regular case all of them will be removed from that
list.
So in a sense we create dynamically a huge temporary structure just to make
sure that all blocks have enough replicas.
In addition to that in pre HADOOP-2606 versions (before release 0.17) the
replication monitor would start processing
those under-replicated blocks, and try to assign nodes for copying blocks.
The structure works fine during regular operation because it contains only
those blocks that are in fact under-replicated.
The processing of block reports goes 5 times faster if the addition to the
needed replications queue is removed.
# We also should reduce the number of block lookups in the blocksMap. I counted
5 lookups just in addstoredBlocks() while
only one lookup is necessary because the name-space is locked and there is only
thread that modifies it.
> Fast Cluster Restart
> --------------------
>
> Key: HADOOP-3022
> URL: https://issues.apache.org/jira/browse/HADOOP-3022
> Project: Hadoop Core
> Issue Type: New Feature
> Components: dfs
> Reporter: Robert Chansler
> Assignee: Konstantin Shvachko
> Fix For: 0.18.0
>
>
> This item introduces a discussion of how to reduce the time necessary to
> start a large cluster from tens of minutes to a handful of minutes.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.