[jira] Commented: (HADOOP-1687) Name-node memory size estimates and optimization proposal.

Konstantin Shvachko (JIRA) Wed, 26 Sep 2007 19:41:14 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12530618
 ]


Konstantin Shvachko commented on HADOOP-1687:
---------------------------------------------

h3. P.S.

The benchmarks I run confirm that this changes *doubled* the name-node space 
utilization compared to hadoop-0.14.1 
and *quadrupled* it compared to hadoop-0.13.1. The latter is because crc files 
are not a part of the name space anymore.

My benchmark runs the name-node with a small fixed maximum heap size.
For each hadoop version I run it 3 times separately for the heap size values of 
-Xmx = 1, 4, and 8 MB.
The data-nodes never send block reports, since the report processing requires 
extra memory.
The client part of the benchmark generates new files with small blocks until 
the name-node runs out of memory (OutOfMemoryError).
The benchmark is written in such a way that it keeps the ratio of #files / 
#block / #directories close to a real cluster distribution.
The file name length is also set to average the name lengths of the real 
cluster.

Table 1 summarizes the benchmark reults:
Note that for hadoop-0.13.1 # files includes crc files as reported by fsck. 
In order to get the "real" number of files one should devided that number by 2.
|| 1 MB  ||hadoop 0.13.1||hadoop 0.14.1||hadoop 0.15||
| files    |  1016  |  816 | 1327 |
| blocks| 1268 | 1205 | 1478 |
| dirs    |     19 |     28 |    46 |
| total objects |2303| 2049 | 2851 |
|| 4 MB  ||hadoop 0.13.1||hadoop 0.14.1||hadoop 0.15||
| files    |   4914 |  4230 |   8265 |
| blocks|  6146 | 6310 | 12287 |
| dirs    |       83 |  137 |     262 |
| total objects | 11143 | 10677 | 20814 |
|| 8 MB  ||hadoop 0.13.1||hadoop 0.14.1||hadoop 0.15||
| files    | 12818 | 11241 | 22039 |
| blocks| 16003 | 16893 | 32969 |
| dirs    |     206 |     358 |     694 |
| total objects | 29027 | 28492 | 55702 |

hadoop14 has a little bit less files than hadoop13, which is consistent with 
the introduction of mod time.
1M heap is not representative because it's too small. 4mB and 8MB heaps show 
advantages of hadoop15 in all components.
The numbers are consistent with my estimates and expectations.

I also checked the performance of file creation on all three versions.
I run the same benchmark, but the name-node does not have memory restrictions 
and
the client creates exactly 2048 files in 64 directories and measures the 
elapsed time.

Table 1 summarizes the benchmark reults:
|| ||hadoop 0.13.1||hadoop 0.14.1||hadoop 0.15||
|time in sec |  354 | 186 | 269 |
|  files per sec |   6 |  11 |   8 |
| blocks per sec |  8 | 16 | 12 |
| objects per sec | 14 | 27 | 20 |

hadoop15 is faster than hadoop13, but slower than hadoop14.
There is a field for optimization here.

> Name-node memory size estimates and optimization proposal.
> ----------------------------------------------------------
>
>                 Key: HADOOP-1687
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1687
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>    Affects Versions: 0.13.1
>            Reporter: Konstantin Shvachko
>            Assignee: Konstantin Shvachko
>             Fix For: 0.15.0
>
>         Attachments: DNBlockList.patch
>
>
> I've done some estimates on how much space our data structures take on the 
> name-node per block, file and directory.
> Brief overview of the data structures:
> Directory tree (FSDirectory) is built of inodes. Each INode points either to 
> an array of blocks
> if it corresponds to a file or to a TreeMap<String, INode> of children INodes 
> if it is a directory.
> [Note: this estimates were made before Dhruba replaced the children TreeMap 
> by ArrayList.]
> Each block participates also in at least 2 more data structures.
> BlocksMap contains a HashMap<Block, BlockInfo> of all blocks mapping a Block 
> into a BlockInfo.
> DatanodeDescriptor contains a TreeMap<Block, Block> of all blocks belonging 
> to this data-node.
> A block may or may not be contained also in other data-structures, like
> {code}
> UnderReplicatedBlocks
> PendingReplicationBlocks
> recentInvalidateSets
> excessReplicateMap
> {code}
> Presence of a block in any of these structures is temporary and therefore I 
> do not count them in my estimates.
> The estimates can be viewed as lower bounds.
> These are some classes that we are looking at here
> {code}
> class INode {
>    String name;
>    INode parent;
>    TreeMap<String, INode> children;
>    Block blocks[];
>    short blockReplication;
>    long modificationTime;
> }
> class Block {
>    long blkid;
>    long len;
> }
> class BlockInfo {
>    FSDirectory.INode       inode;
>    DatanodeDescriptor[]   nodes;
>    Block                          block;
> }
> {code}
> The calculations are made for a 64-bit java vm based on that
> Reference size = 8 bytes
> Object header size = 16 bytes
> Array header size = 24 bytes
> Commonly used objects:
> TreeMap.Entry = 64 bytes. It has 5 reference fields
> HashMap.Entry = 48 bytes. It has 3 reference fields
> String header = 64 bytes.
> The size of a file includes:
> # Size of an empty file INode: INode.children = null, INode.blocks is a 
> 0-length array, and file name is empty. (152 bytes)
> # A directory entry of the parent INode that points to this file, which is a 
> TreeMap.Entry. (64 bytes)
> # file name length times 2, because String represents each name character by 
> 2 bytes.
> # Reference to the outer FSDirectory class (8 bytes)
> The total: 224 + 2 * fileName.length
> The size of a directory includes:
> # Size of an empty directory INode: INode.children is an empty TreeMap, 
> INode.blocks = null, and file name is empty. (192 bytes)
> # A directory entry of the parent INode that points to this file, which is a 
> TreeMap.Entry. (64 bytes)
> # file name length times 2
> # Reference to the outer FSDirectory class (8 bytes)
> The total: 264 + 2 * fileName.length
> The size of a block includes:
> # Size of Block. (32 bytes)
> # Size of BlockInfo. (64 + 8*replication bytes)
> # Reference to the block from INode.blocks (8 bytes)
> # HashMap.Entry referencing the block from BlocksMap. (48 bytes)
> # References to the block from all DatanodeDescriptors it belongs to.
> This is a TreeMap.Entry size times block replication. (64 * replication)
> The total: 152 + 72 * replication
> Typical object sizes:
> Taking into account that typical file name is 10-15 bytes and our default 
> replication is 3 we can say that typical sizes are
> File size: 250
> Directory size: 290
> Block size: 368
> ||Object||size estimate (bytes)||typical size (bytes)||
> |File|224 + 2 * fileName.length|250|
> |Directory|264 + 2 * fileName.length|290|
> |Block|152 + 72 * replication|368|
> One of our clusters has
> # Files:  10 600 000
> # Dirs:      310 000
> # Blocks: 13 300 000
> Total Size (estimate): 7,63 GB
> Memory used on the name-node (actual reported by jconsole after gc): 9 GB
> This means that other data structures like NetworkTopology, heartbeats, 
> datanodeMap, Host2NodesMap,
> leases, sortedLeases, and multiple block replication maps occupy ~1.4 GB, 
> which seems to be pretty high
> and need to be investigated as well.
> Based on the above estimates blocks should be the main focus of the name-node 
> memory reduction effort.
> Space used by a block is 50% larger compared to a file, and there is more 
> blocks than files or directories.
> Some ideas on how we can reduce the name-node size without substantially 
> changing the data structures.
> # INode.children should be an ArrayList instead of a TreeMap. Already done 
> HADOOP-1565. (-48 bytes)
> # Factor out the INode class into a separate class (-8 bytes)
> # Create base INode class and derive file inode and directory inode classes 
> from the base.
> Directory inodes do not need to contain blocks and replication fields (-16 
> bytes)
> File inodes do not need to contain children field (-8 bytes)
> # String name should be replaced by a mere byte[]. (-(40 + fileName.length) ~ 
> -50 bytes)
> # Eliminate the Block object.
> We should move Block fields into into BlockInfo and completely get rid of the 
> Block object. (-16 bytes)
> # Block object is referenced at least 5 times in our structures for each 
> physical block.
> The number of references should be reduced to just 2. (-24)
> # Remove name field from INode. File or directory name is stored in the 
> corresponding directory
> entry and does need to be duplicated in the INode (-8 bytes)
> # Eliminate INode.parent field. INodes are accessed through the directory 
> tree, and the parent can
> be remembered in a local variable while browsing the tree. There is no need 
> to persistently store
> the parent reference for each object. (-8 bytes)
> # Need to optimize data-node to block map. Currently each DatanodeDescriptor 
> holds a TreeMap of
> blocks contained in the node, and we have an overhead of one TreeMap.Entry 
> per block replica.
> I expect we can reorganize datanodeMap in a way that it stores only 1 or 2 
> references per replica
> instead of an entire TreeMap.Entry. (-48 * replication)
> Note: In general TreeMaps turned out to be very expensive, we should avoid 
> using them if possible.
> Or implement a custom map structure, which would avoid using objects for each 
> map entry.
> This is what we will have after all the optimizations
> ||Object||size estimate (bytes)||typical size (bytes)||current typical size 
> (bytes)||
> |File|112 + fileName.length|125|250|
> |Directory|144 + fileName.length|155|290|
> |Block|112 + 24 * replication|184|368|

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1687) Name-node memory size estimates and optimization proposal.

Reply via email to