[ 
https://issues.apache.org/jira/browse/HDFS-12051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16314358#comment-16314358
 ] 

Misha Dmitriev commented on HDFS-12051:
---------------------------------------

>> ... it's nice to finally see someone other than Yongjun comment on this 
>> patch so many months after I first submitted it.
> It is probably because of the summary and description of this JIRA.

I believe the summary/description still match the code. I just added the calls 
to NameCache.intern() in a couple more files recently. In many applications a 
similar addition, like calls to String.intern(), have to be made in even more 
places to get rid of all duplicate objects.

>> ... I don't think it makes this a major change. ...
> No, this is not true. INode and FSImage/Edit are core components of Namenode.

I agree. What I meant is that this change doesn't affect the functionality and 
its impact is easy to understand. It's not like some complex change to the core 
logic of Namenode.

>> ... Please keep in mind that it doesn't change any code functionality - it 
>> just improves performance by reducing data duplication.
> How about running time? Does this patch also improves it?

The extra code on itself doesn't noticeably affect the running time. However, 
reducing the number of objects and heap size reduces GC pauses. So the net 
effect on running time should be an improvement.

> Intern INOdeFileAttributes$SnapshotCopy.name byte[] arrays to save memory
> -------------------------------------------------------------------------
>
>                 Key: HDFS-12051
>                 URL: https://issues.apache.org/jira/browse/HDFS-12051
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Misha Dmitriev
>            Assignee: Misha Dmitriev
>         Attachments: HDFS-12051.01.patch, HDFS-12051.02.patch, 
> HDFS-12051.03.patch, HDFS-12051.04.patch, HDFS-12051.05.patch, 
> HDFS-12051.06.patch
>
>
> When snapshot diff operation is performed in a NameNode that manages several 
> million HDFS files/directories, NN needs a lot of memory. Analyzing one heap 
> dump with jxray (www.jxray.com), we observed that duplicate byte[] arrays 
> result in 6.5% memory overhead, and most of these arrays are referenced by 
> {{org.apache.hadoop.hdfs.server.namenode.INodeFileAttributes$SnapshotCopy.name}}
>  and {{org.apache.hadoop.hdfs.server.namenode.INodeFile.name}}:
> {code}
> 19. DUPLICATE PRIMITIVE ARRAYS
> Types of duplicate objects:
>      Ovhd         Num objs  Num unique objs   Class name
> 3,220,272K (6.5%)   104749528      25760871         byte[]
> ....
>   1,841,485K (3.7%), 53194037 dup arrays (13158094 unique)
> 3510556 of byte[17](112, 97, 114, 116, 45, 109, 45, 48, 48, 48, ...), 2228255 
> of byte[8](48, 48, 48, 48, 48, 48, 95, 48), 357439 of byte[17](112, 97, 114, 
> 116, 45, 109, 45, 48, 48, 48, ...), 237395 of byte[8](48, 48, 48, 48, 48, 49, 
> 95, 48), 227853 of byte[17](112, 97, 114, 116, 45, 109, 45, 48, 48, 48, ...), 
> 179193 of byte[17](112, 97, 114, 116, 45, 109, 45, 48, 48, 48, ...), 169487 
> of byte[8](48, 48, 48, 48, 48, 50, 95, 48), 145055 of byte[17](112, 97, 114, 
> 116, 45, 109, 45, 48, 48, 48, ...), 128134 of byte[8](48, 48, 48, 48, 48, 51, 
> 95, 48), 108265 of byte[17](112, 97, 114, 116, 45, 109, 45, 48, 48, 48, ...)
> ... and 45902395 more arrays, of which 13158084 are unique
>      <-- 
> org.apache.hadoop.hdfs.server.namenode.INodeFileAttributes$SnapshotCopy.name 
> <-- org.apache.hadoop.hdfs.server.namenode.snapshot.FileDiff.snapshotINode 
> <--  {j.u.ArrayList} <-- 
> org.apache.hadoop.hdfs.server.namenode.snapshot.FileDiffList.diffs <-- 
> org.apache.hadoop.hdfs.server.namenode.snapshot.FileWithSnapshotFeature.diffs 
> <-- org.apache.hadoop.hdfs.server.namenode.INode$Feature[] <-- 
> org.apache.hadoop.hdfs.server.namenode.INodeFile.features <-- 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo.bc <-- ... (1 
> elements) ... <-- 
> org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap$1.entries <-- 
> org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap.blocks <-- 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.blocksMap <-- 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$BlockReportProcessingThread.this$0
>  <-- j.l.Thread[] <-- j.l.ThreadGroup.threads <-- j.l.Thread.group <-- Java 
> Static: org.apache.hadoop.fs.FileSystem$Statistics.STATS_DATA_CLEANER
>   409,830K (0.8%), 13482787 dup arrays (13260241 unique)
> 430 of byte[32](116, 97, 115, 107, 95, 49, 52, 57, 55, 48, ...), 353 of 
> byte[32](116, 97, 115, 107, 95, 49, 52, 57, 55, 48, ...), 352 of 
> byte[32](116, 97, 115, 107, 95, 49, 52, 57, 55, 48, ...), 350 of 
> byte[32](116, 97, 115, 107, 95, 49, 52, 57, 55, 48, ...), 342 of 
> byte[32](116, 97, 115, 107, 95, 49, 52, 57, 55, 48, ...), 341 of 
> byte[32](116, 97, 115, 107, 95, 49, 52, 57, 55, 48, ...), 341 of 
> byte[32](116, 97, 115, 107, 95, 49, 52, 57, 55, 48, ...), 340 of 
> byte[32](116, 97, 115, 107, 95, 49, 52, 57, 55, 48, ...), 337 of 
> byte[32](116, 97, 115, 107, 95, 49, 52, 57, 55, 48, ...), 334 of 
> byte[32](116, 97, 115, 107, 95, 49, 52, 57, 55, 48, ...)
> ... and 13479257 more arrays, of which 13260231 are unique
>      <-- org.apache.hadoop.hdfs.server.namenode.INodeFile.name <-- 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockInfo.bc <-- 
> org.apache.hadoop.util.LightWeightGSet$LinkedElement[] <-- 
> org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap$1.entries <-- 
> org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap.blocks <-- 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.blocksMap <-- 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$BlockReportProcessingThread.this$0
>  <-- j.l.Thread[] <-- 
> org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap$1.entries <-- 
> org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap.blocks <-- 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.blocksMap <-- 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$BlockReportProcessingThread.this$0
>  <-- j.l.Thread[] <-- j.l.ThreadGroup.threads <-- j.l.Thread.group <-- Java 
> Static: org.apache.hadoop.fs.FileSystem$Statistics.STATS_DATA_CLEANER
> ....
> {code}
> To eliminate this duplication and reclaim memory, we will need to write a 
> small class similar to StringInterner, but designed specifically for byte[] 
> arrays.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to