[jira] [Commented] (HDFS-12051) Reimplement NameCache in NameNode: Intern duplicate byte[] arrays (mainly those denoting file/directory names) to save memory

Misha Dmitriev (JIRA) Thu, 25 Jan 2018 14:09:12 -0800

    [ 
https://issues.apache.org/jira/browse/HDFS-12051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16340176#comment-16340176
 ]


Misha Dmitriev commented on HDFS-12051:
---------------------------------------

Thank you for the review, [~manojg] See my responses inline below.

{{NameCache.java}}
 * _line 97: {{cache = new byte[cacheSize][];}} Since this will take up a 
contiguous space, we need to restrict the cache size to much lesser size than 
your current MAX size of 1 << 30. Your thoughts?_

As you can see from line 78, the cache size is always set from the 
configuration, which provides a reasonable default, which is much, much smaller 
than 1<<30. It's up to the customer to increase this value if they need. If 
they have a huge heap, like 120GB (I've already heard of users approaching 
this!), then with 1<<30 it will result in an 8GB contiguous array. With a huge 
heap already, it is nothing wrong, if they really need it. But, anyway, if they 
decide to change this number, I think it's reasonable to expect them to have 
some understanding of what they are doing.

_{{#cache}} is now following the {{open addressing}} model. Any reasons why you 
moved to this model compared to your initial design?_

My own design for this cache has always been open addressing. The reason is 
that this is the most economical model in terms of memory: it uses just one 
pointer per cache entry, which is 8 bytes at most. If you use a cache with 
collision chains, like in java.util.HashMap, then each entry requires a pointer 
and a separate object (HashMap$Entry) This separate object takes at least 32 
bytes, so you end up with at least 40 bytes per entry - five times more!

Now, for a real HashMap, that needs to hold potentially very large number of 
objects, and needs to hold them all, the collision chain design may be 
justified in some cases. But for our specialized fixed-size cache, that strives 
to minimize its own memory overhead, the open addressing design is more 
appropriate.
 * _{{#put()}}_ 
 ** _line 118: the first time cache fill .. shouldn't it be a new byte array 
name constructed from the passed in name? Why use the same caller passed in 
name?_

The goal of this cache is to _avoid_ object duplication as much as possible. If 
the caller gave us a name X for which we don't have an existing copy, just 
remember X and return it. If on the next invocation they gave us Y and it turns 
out to be the same as X, return X again, and Y will be effectively lost and 
GCed soon.

 
 * _With the {{open addressing}} model, when you overwrite the cache slot with 
new names,  there could be INodes which are already referring to this name and 
are cut from the cache. Though their references are still valid, want  to 
understand why the preference given to new names compared to the old one._

The preference is given to new names simply because it's the lesser evil. We 
already discussed this with [~yzhangal] in the past. First, obviously when a 
cache entry is overwritten, the old INodes will just continue to refer to their 
old names, i.e. no information is lost. Second, all our solution details stem 
from the fact that we don't know in advance how many names we are going to 
have, and how many of them will be duplicate. Thus we want to have a fixed-size 
cache that will be guaranteed to not waste much memory if there is little 
duplication, but will provide a benefit and will save a lot of memory if there 
is considerable duplication.

Now, suppose we have a cache of size 3, and names come as follows: 'A', 'B', 
'C', 'D', 'D', 'D', ... The cache would be full after the first 3 names. If 
after that we don't override one of the entries to accomodate 'D', we will not 
realize any savings from deduplicating all the subsequent 'D's. To be fair, if 
this cache receives something like 'A', 'B', 'C', 'D', 'E', 'F', 'A', 'B', 'C', 
'D', 'E', 'F' - then it just gets rewritten all the time and provides no 
benefit. But in practice (and I have already implemented a similar cache in 
several other projects), I've never observed such a pathology. With a 
reasonable-size cache and real-life data, it always works.
 * _I don't see any cache invalidation even when the INodes are removed. This 
takes up memory. Though not huge, design wise its not clean to leave the cache 
with stale values and incur cache lookup penalty in the future put()_ 

This cache by default takes just 16MB, which is 0.1% of 16GB, which is on the 
smaller side of NN heap size spectrum. So any losses due to stale cache entries 
are pretty negligible. Furthermore, the above-mentioned overwriting of cache 
entries when new data is coming also helps to keep the cache reasonably "fresh".
 * _{{#getSize()}} since there is no cache invalidation, and since this open 
addressing model, the size returned is not right._

As the javadoc for this method explains, this method may return a slightly 
incorrect result because of races, and is intended to be used only for 
debugging/monitoring. I can rename it to getApproximateSize() or remove it 
completely. What would you suggest?
 * _line 149: {{cacheSizeFor}} is this roundUp or roundDown to the nearest 2 
power. Please add the link to {{HashMap#tableSizeFor()}} in the comment to show 
where the code is inspired from._

Good observation. I've tested and it's always rounding up. Updated the javadoc 
per your suggestion.

> Reimplement NameCache in NameNode: Intern duplicate byte[] arrays (mainly 
> those denoting file/directory names) to save memory
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-12051
>                 URL: https://issues.apache.org/jira/browse/HDFS-12051
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Misha Dmitriev
>            Assignee: Misha Dmitriev
>            Priority: Major
>         Attachments: HDFS-12051.01.patch, HDFS-12051.02.patch, 
> HDFS-12051.03.patch, HDFS-12051.04.patch, HDFS-12051.05.patch, 
> HDFS-12051.06.patch, HDFS-12051.07.patch
>
>
> When snapshot diff operation is performed in a NameNode that manages several 
> million HDFS files/directories, NN needs a lot of memory. Analyzing one heap 
> dump with jxray (www.jxray.com), we observed that duplicate byte[] arrays 
> result in 6.5% memory overhead, and most of these arrays are referenced by 
> {{org.apache.hadoop.hdfs.server.namenode.INodeFileAttributes$SnapshotCopy.name}}
>  and {{org.apache.hadoop.hdfs.server.namenode.INodeFile.name}}:
> {code:java}
> 19. DUPLICATE PRIMITIVE ARRAYS
> Types of duplicate objects:
>      Ovhd         Num objs  Num unique objs   Class name
> 3,220,272K (6.5%)   104749528      25760871         byte[]
> ....
>   1,841,485K (3.7%), 53194037 dup arrays (13158094 unique)
> 3510556 of byte[17](112, 97, 114, 116, 45, 109, 45, 48, 48, 48, ...), 2228255 
> of byte[8](48, 48, 48, 48, 48, 48, 95, 48), 357439 of byte[17](112, 97, 114, 
> 116, 45, 109, 45, 48, 48, 48, ...), 237395 of byte[8](48, 48, 48, 48, 48, 49, 
> 95, 48), 227853 of byte[17](112, 97, 114, 116, 45, 109, 45, 48, 48, 48, ...), 
> 179193 of byte[17](112, 97, 114, 116, 45, 109, 45, 48, 48, 48, ...), 169487 
> of byte[8](48, 48, 48, 48, 48, 50, 95, 48), 145055 of byte[17](112, 97, 114, 
> 116, 45, 109, 45, 48, 48, 48, ...), 128134 of byte[8](48, 48, 48, 48, 48, 51, 
> 95, 48), 108265 of byte[17](112, 97, 114, 116, 45, 109, 45, 48, 48, 48, ...)
> ... and 45902395 more arrays, of which 13158084 are unique
>      <-- 
> org.apache.hadoop.hdfs.server.namenode.INodeFileAttributes$SnapshotCopy.name 
> <-- org.apache.hadoop.hdfs.server.namenode.snapshot.FileDiff.snapshotINode 
> <--  {j.u.ArrayList} <-- 
> org.apache.hadoop.hdfs.server.namenode.snapshot.FileDiffList.diffs <-- 
> org.apache.hadoop.hdfs.server.namenode.snapshot.FileWithSnapshotFeature.diffs 
> <-- org.apache.hadoop.hdfs.server.namenode.INode$Feature[] <-- 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.features <-- 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo.bc <-- ... (1 
> elements) ... <-- 
> org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap$1.entries <-- 
> org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap.blocks <-- 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.blocksMap <-- 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$BlockReportProcessingThread.this$0
>  <-- j.l.Thread[] <-- j.l.ThreadGroup.threads <-- j.l.Thread.group <-- Java 
> Static: org.apache.hadoop.fs.FileSystem$Statistics.STATS_DATA_CLEANER
>   409,830K (0.8%), 13482787 dup arrays (13260241 unique)
> 430 of byte[32](116, 97, 115, 107, 95, 49, 52, 57, 55, 48, ...), 353 of 
> byte[32](116, 97, 115, 107, 95, 49, 52, 57, 55, 48, ...), 352 of 
> byte[32](116, 97, 115, 107, 95, 49, 52, 57, 55, 48, ...), 350 of 
> byte[32](116, 97, 115, 107, 95, 49, 52, 57, 55, 48, ...), 342 of 
> byte[32](116, 97, 115, 107, 95, 49, 52, 57, 55, 48, ...), 341 of 
> byte[32](116, 97, 115, 107, 95, 49, 52, 57, 55, 48, ...), 341 of 
> byte[32](116, 97, 115, 107, 95, 49, 52, 57, 55, 48, ...), 340 of 
> byte[32](116, 97, 115, 107, 95, 49, 52, 57, 55, 48, ...), 337 of 
> byte[32](116, 97, 115, 107, 95, 49, 52, 57, 55, 48, ...), 334 of 
> byte[32](116, 97, 115, 107, 95, 49, 52, 57, 55, 48, ...)
> ... and 13479257 more arrays, of which 13260231 are unique
>      <-- org.apache.hadoop.hdfs.server.namenode.INodeFile.name <-- 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo.bc <-- 
> org.apache.hadoop.util.LightWeightGSet$LinkedElement[] <-- 
> org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap$1.entries <-- 
> org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap.blocks <-- 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.blocksMap <-- 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$BlockReportProcessingThread.this$0
>  <-- j.l.Thread[] <-- 
> org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap$1.entries <-- 
> org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap.blocks <-- 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.blocksMap <-- 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$BlockReportProcessingThread.this$0
>  <-- j.l.Thread[] <-- j.l.ThreadGroup.threads <-- j.l.Thread.group <-- Java 
> Static: org.apache.hadoop.fs.FileSystem$Statistics.STATS_DATA_CLEANER
> ....
> {code}
> There are several other places in NameNode code which also produce duplicate 
> {{byte[]}} arrays.
> To eliminate this duplication and reclaim memory, we will need to write a 
> small class similar to StringInterner, but designed specifically for byte[] 
> arrays.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-12051) Reimplement NameCache in NameNode: Intern duplicate byte[] arrays (mainly those denoting file/directory names) to save memory

Reply via email to