[jira] Created: (HADOOP-1687) Name-node memory size estimates and optimization proposal.

Konstantin Shvachko (JIRA) Mon, 06 Aug 2007 14:44:22 -0700

Name-node memory size estimates and optimization proposal.
----------------------------------------------------------


                 Key: HADOOP-1687
                 URL: https://issues.apache.org/jira/browse/HADOOP-1687
             Project: Hadoop
          Issue Type: Improvement
          Components: dfs
    Affects Versions: 0.13.1
            Reporter: Konstantin Shvachko
            Assignee: Konstantin Shvachko
             Fix For: 0.15.0


I've done some estimates on how much space our data structures take on the 
name-node per block, file and directory.

Brief overview of the data structures:
Directory tree (FSDirectory) is built of inodes. Each INode points either to an 
array of blocks
if it corresponds to a file or to a TreeMap<String, INode> of children INodes 
if it is a directory.
[Note: this estimates were made before Dhruba replaced the children TreeMap by 
ArrayList.]
Each block participates also in at least 2 more data structures.
BlocksMap contains a HashMap<Block, BlockInfo> of all blocks mapping a Block 
into a BlockInfo.
DatanodeDescriptor contains a TreeMap<Block, Block> of all blocks belonging to 
this data-node.
A block may or may not be contained also in other data-structures, like
{code}
UnderReplicatedBlocks
PendingReplicationBlocks
recentInvalidateSets
excessReplicateMap
{code}
Presence of a block in any of these structures is temporary and therefore I do 
not count them in my estimates.
The estimates can be viewed as lower bounds.

These are some classes that we are looking at here
{code}
class INode {
   String name;
   INode parent;
   TreeMap<String, INode> children;
   Block blocks[];
   short blockReplication;
   long modificationTime;
}

class Block {
   long blkid;
   long len;
}

class BlockInfo {
   FSDirectory.INode       inode;
   DatanodeDescriptor[]   nodes;
   Block                          block;
}
{code}

The calculations are made for a 64-bit java vm based on that
Reference size = 8 bytes
Object header size = 16 bytes
Array header size = 24 bytes

Commonly used objects:
TreeMap.Entry = 64 bytes. It has 5 reference fields
HashMap.Entry = 48 bytes. It has 3 reference fields
String header = 64 bytes.

The size of a file includes:
# Size of an empty file INode: INode.children = null, INode.blocks is a 
0-length array, and file name is empty. (152 bytes)
# A directory entry of the parent INode that points to this file, which is a 
TreeMap.Entry. (64 bytes)
# file name length times 2, because String represents each name character by 2 
bytes.
# Reference to the outer FSDirectory class (8 bytes)
The total: 224 + 2 * fileName.length

The size of a directory includes:
# Size of an empty directory INode: INode.children is an empty TreeMap, 
INode.blocks = null, and file name is empty. (192 bytes)
# A directory entry of the parent INode that points to this file, which is a 
TreeMap.Entry. (64 bytes)
# file name length times 2
# Reference to the outer FSDirectory class (8 bytes)
The total: 264 + 2 * fileName.length

The size of a block includes:
# Size of Block. (32 bytes)
# Size of BlockInfo. (64 + 8*replication bytes)
# Reference to the block from INode.blocks (8 bytes)
# HashMap.Entry referencing the block from BlocksMap. (48 bytes)
# References to the block from all DatanodeDescriptors it belongs to.
This is a TreeMap.Entry size times block replication. (64 * replication)
The total: 152 + 72 * replication

Typical object sizes:
Taking into account that typical file name is 10-15 bytes and our default 
replication is 3 we can say that typical sizes are
File size: 250
Directory size: 290
Block size: 368

||Object||size estimate (bytes)||typical size (bytes)||
|File|224 + 2 * fileName.length|250|
|Directory|264 + 2 * fileName.length|290|
|Block|152 + 72 * replication|368|

One of our clusters has
# Files:  10 600 000
# Dirs:      310 000
# Blocks: 13 300 000
Total Size (estimate): 7,63 GB
Memory used on the name-node (actual reported by jconsole after gc): 9 GB

This means that other data structures like NetworkTopology, heartbeats, 
datanodeMap, Host2NodesMap,
leases, sortedLeases, and multiple block replication maps occupy ~1.4 GB, which 
seems to be pretty high
and need to be investigated as well.

Based on the above estimates blocks should be the main focus of the name-node 
memory reduction effort.
Space used by a block is 50% larger compared to a file, and there is more 
blocks than files or directories.

Some ideas on how we can reduce the name-node size without substantially 
changing the data structures.
# INode.children should be an ArrayList instead of a TreeMap. Already done 
HADOOP-1565. (-48 bytes)
# Factor out the INode class into a separate class (-8 bytes)
# Create base INode class and derive file inode and directory inode classes 
from the base.
Directory inodes do not need to contain blocks and replication fields (-16 
bytes)
File inodes do not need to contain children field (-8 bytes)
# String name should be replaced by a mere byte[]. (-(40 + fileName.length) ~ 
-50 bytes)
# Eliminate the Block object.
We should move Block fields into into BlockInfo and completely get rid of the 
Block object. (-16 bytes)
# Block object is referenced at least 5 times in our structures for each 
physical block.
The number of references should be reduced to just 2. (-24)
# Remove name field from INode. File or directory name is stored in the 
corresponding directory
entry and does need to be duplicated in the INode (-8 bytes)
# Eliminate INode.parent field. INodes are accessed through the directory tree, 
and the parent can
be remembered in a local variable while browsing the tree. There is no need to 
persistently store
the parent reference for each object. (-8 bytes)
# Need to optimize data-node to block map. Currently each DatanodeDescriptor 
holds a TreeMap of
blocks contained in the node, and we have an overhead of one TreeMap.Entry per 
block replica.
I expect we can reorganize datanodeMap in a way that it stores only 1 or 2 
references per replica
instead of an entire TreeMap.Entry. (-48 * replication)

Note: In general TreeMaps turned out to be very expensive, we should avoid 
using them if possible.
Or implement a custom map structure, which would avoid using objects for each 
map entry.


This is what we will have after all the optimizations
||Object||size estimate (bytes)||typical size (bytes)||current typical size 
(bytes)||
|File|112 + fileName.length|125|250|
|Directory|144 + fileName.length|155|290|
|Block|112 + 24 * replication|184|368|


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (HADOOP-1687) Name-node memory size estimates and optimization proposal.

Reply via email to