[
https://issues.apache.org/jira/browse/HDFS-12051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16340176#comment-16340176
]
Misha Dmitriev commented on HDFS-12051:
---------------------------------------
Thank you for the review, [~manojg] See my responses inline below.
{{NameCache.java}}
* _line 97: {{cache = new byte[cacheSize][];}} Since this will take up a
contiguous space, we need to restrict the cache size to much lesser size than
your current MAX size of 1 << 30. Your thoughts?_
As you can see from line 78, the cache size is always set from the
configuration, which provides a reasonable default, which is much, much smaller
than 1<<30. It's up to the customer to increase this value if they need. If
they have a huge heap, like 120GB (I've already heard of users approaching
this!), then with 1<<30 it will result in an 8GB contiguous array. With a huge
heap already, it is nothing wrong, if they really need it. But, anyway, if they
decide to change this number, I think it's reasonable to expect them to have
some understanding of what they are doing.
_{{#cache}} is now following the {{open addressing}} model. Any reasons why you
moved to this model compared to your initial design?_
My own design for this cache has always been open addressing. The reason is
that this is the most economical model in terms of memory: it uses just one
pointer per cache entry, which is 8 bytes at most. If you use a cache with
collision chains, like in java.util.HashMap, then each entry requires a pointer
and a separate object (HashMap$Entry) This separate object takes at least 32
bytes, so you end up with at least 40 bytes per entry - five times more!
Now, for a real HashMap, that needs to hold potentially very large number of
objects, and needs to hold them all, the collision chain design may be
justified in some cases. But for our specialized fixed-size cache, that strives
to minimize its own memory overhead, the open addressing design is more
appropriate.
* _{{#put()}}_
** _line 118: the first time cache fill .. shouldn't it be a new byte array
name constructed from the passed in name? Why use the same caller passed in
name?_
The goal of this cache is to _avoid_ object duplication as much as possible. If
the caller gave us a name X for which we don't have an existing copy, just
remember X and return it. If on the next invocation they gave us Y and it turns
out to be the same as X, return X again, and Y will be effectively lost and
GCed soon.
* _With the {{open addressing}} model, when you overwrite the cache slot with
new names, there could be INodes which are already referring to this name and
are cut from the cache. Though their references are still valid, want to
understand why the preference given to new names compared to the old one._
The preference is given to new names simply because it's the lesser evil. We
already discussed this with [~yzhangal] in the past. First, obviously when a
cache entry is overwritten, the old INodes will just continue to refer to their
old names, i.e. no information is lost. Second, all our solution details stem
from the fact that we don't know in advance how many names we are going to
have, and how many of them will be duplicate. Thus we want to have a fixed-size
cache that will be guaranteed to not waste much memory if there is little
duplication, but will provide a benefit and will save a lot of memory if there
is considerable duplication.
Now, suppose we have a cache of size 3, and names come as follows: 'A', 'B',
'C', 'D', 'D', 'D', ... The cache would be full after the first 3 names. If
after that we don't override one of the entries to accomodate 'D', we will not
realize any savings from deduplicating all the subsequent 'D's. To be fair, if
this cache receives something like 'A', 'B', 'C', 'D', 'E', 'F', 'A', 'B', 'C',
'D', 'E', 'F' - then it just gets rewritten all the time and provides no
benefit. But in practice (and I have already implemented a similar cache in
several other projects), I've never observed such a pathology. With a
reasonable-size cache and real-life data, it always works.
* _I don't see any cache invalidation even when the INodes are removed. This
takes up memory. Though not huge, design wise its not clean to leave the cache
with stale values and incur cache lookup penalty in the future put()_
This cache by default takes just 16MB, which is 0.1% of 16GB, which is on the
smaller side of NN heap size spectrum. So any losses due to stale cache entries
are pretty negligible. Furthermore, the above-mentioned overwriting of cache
entries when new data is coming also helps to keep the cache reasonably "fresh".
* _{{#getSize()}} since there is no cache invalidation, and since this open
addressing model, the size returned is not right._
As the javadoc for this method explains, this method may return a slightly
incorrect result because of races, and is intended to be used only for
debugging/monitoring. I can rename it to getApproximateSize() or remove it
completely. What would you suggest?
* _line 149: {{cacheSizeFor}} is this roundUp or roundDown to the nearest 2
power. Please add the link to {{HashMap#tableSizeFor()}} in the comment to show
where the code is inspired from._
Good observation. I've tested and it's always rounding up. Updated the javadoc
per your suggestion.
> Reimplement NameCache in NameNode: Intern duplicate byte[] arrays (mainly
> those denoting file/directory names) to save memory
> -----------------------------------------------------------------------------------------------------------------------------
>
> Key: HDFS-12051
> URL: https://issues.apache.org/jira/browse/HDFS-12051
> Project: Hadoop HDFS
> Issue Type: Improvement
> Reporter: Misha Dmitriev
> Assignee: Misha Dmitriev
> Priority: Major
> Attachments: HDFS-12051.01.patch, HDFS-12051.02.patch,
> HDFS-12051.03.patch, HDFS-12051.04.patch, HDFS-12051.05.patch,
> HDFS-12051.06.patch, HDFS-12051.07.patch
>
>
> When snapshot diff operation is performed in a NameNode that manages several
> million HDFS files/directories, NN needs a lot of memory. Analyzing one heap
> dump with jxray (www.jxray.com), we observed that duplicate byte[] arrays
> result in 6.5% memory overhead, and most of these arrays are referenced by
> {{org.apache.hadoop.hdfs.server.namenode.INodeFileAttributes$SnapshotCopy.name}}
> and {{org.apache.hadoop.hdfs.server.namenode.INodeFile.name}}:
> {code:java}
> 19. DUPLICATE PRIMITIVE ARRAYS
> Types of duplicate objects:
> Ovhd Num objs Num unique objs Class name
> 3,220,272K (6.5%) 104749528 25760871 byte[]
> ....
> 1,841,485K (3.7%), 53194037 dup arrays (13158094 unique)
> 3510556 of byte[17](112, 97, 114, 116, 45, 109, 45, 48, 48, 48, ...), 2228255
> of byte[8](48, 48, 48, 48, 48, 48, 95, 48), 357439 of byte[17](112, 97, 114,
> 116, 45, 109, 45, 48, 48, 48, ...), 237395 of byte[8](48, 48, 48, 48, 48, 49,
> 95, 48), 227853 of byte[17](112, 97, 114, 116, 45, 109, 45, 48, 48, 48, ...),
> 179193 of byte[17](112, 97, 114, 116, 45, 109, 45, 48, 48, 48, ...), 169487
> of byte[8](48, 48, 48, 48, 48, 50, 95, 48), 145055 of byte[17](112, 97, 114,
> 116, 45, 109, 45, 48, 48, 48, ...), 128134 of byte[8](48, 48, 48, 48, 48, 51,
> 95, 48), 108265 of byte[17](112, 97, 114, 116, 45, 109, 45, 48, 48, 48, ...)
> ... and 45902395 more arrays, of which 13158084 are unique
> <--
> org.apache.hadoop.hdfs.server.namenode.INodeFileAttributes$SnapshotCopy.name
> <-- org.apache.hadoop.hdfs.server.namenode.snapshot.FileDiff.snapshotINode
> <-- {j.u.ArrayList} <--
> org.apache.hadoop.hdfs.server.namenode.snapshot.FileDiffList.diffs <--
> org.apache.hadoop.hdfs.server.namenode.snapshot.FileWithSnapshotFeature.diffs
> <-- org.apache.hadoop.hdfs.server.namenode.INode$Feature[] <--
> org.apache.hadoop.hdfs.server.namenode.INodeFile.features <--
> org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo.bc <-- ... (1
> elements) ... <--
> org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap$1.entries <--
> org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap.blocks <--
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.blocksMap <--
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$BlockReportProcessingThread.this$0
> <-- j.l.Thread[] <-- j.l.ThreadGroup.threads <-- j.l.Thread.group <-- Java
> Static: org.apache.hadoop.fs.FileSystem$Statistics.STATS_DATA_CLEANER
> 409,830K (0.8%), 13482787 dup arrays (13260241 unique)
> 430 of byte[32](116, 97, 115, 107, 95, 49, 52, 57, 55, 48, ...), 353 of
> byte[32](116, 97, 115, 107, 95, 49, 52, 57, 55, 48, ...), 352 of
> byte[32](116, 97, 115, 107, 95, 49, 52, 57, 55, 48, ...), 350 of
> byte[32](116, 97, 115, 107, 95, 49, 52, 57, 55, 48, ...), 342 of
> byte[32](116, 97, 115, 107, 95, 49, 52, 57, 55, 48, ...), 341 of
> byte[32](116, 97, 115, 107, 95, 49, 52, 57, 55, 48, ...), 341 of
> byte[32](116, 97, 115, 107, 95, 49, 52, 57, 55, 48, ...), 340 of
> byte[32](116, 97, 115, 107, 95, 49, 52, 57, 55, 48, ...), 337 of
> byte[32](116, 97, 115, 107, 95, 49, 52, 57, 55, 48, ...), 334 of
> byte[32](116, 97, 115, 107, 95, 49, 52, 57, 55, 48, ...)
> ... and 13479257 more arrays, of which 13260231 are unique
> <-- org.apache.hadoop.hdfs.server.namenode.INodeFile.name <--
> org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo.bc <--
> org.apache.hadoop.util.LightWeightGSet$LinkedElement[] <--
> org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap$1.entries <--
> org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap.blocks <--
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.blocksMap <--
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$BlockReportProcessingThread.this$0
> <-- j.l.Thread[] <--
> org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap$1.entries <--
> org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap.blocks <--
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.blocksMap <--
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$BlockReportProcessingThread.this$0
> <-- j.l.Thread[] <-- j.l.ThreadGroup.threads <-- j.l.Thread.group <-- Java
> Static: org.apache.hadoop.fs.FileSystem$Statistics.STATS_DATA_CLEANER
> ....
> {code}
> There are several other places in NameNode code which also produce duplicate
> {{byte[]}} arrays.
> To eliminate this duplication and reclaim memory, we will need to write a
> small class similar to StringInterner, but designed specifically for byte[]
> arrays.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]