[ 
https://issues.apache.org/jira/browse/HDFS-11383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025591#comment-16025591
 ] 

Misha Dmitriev commented on HDFS-11383:
---------------------------------------

Here is an excerpt from another jxray report, obtained from a real-life 
production run of Impala Catalog Server. It turns out that it wastes 36% of its 
memory due to duplicate strings, and of them at least 20% come from 
hadoop.fs.BlockLocation data fields. Some more duplicates come from other HDFS 
classes: org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage and 
org.apache.hadoop.hdfs.protocol.ExtendedBlock.

The second patch that I've just added has string interning added for 
ExtendedBlock.poolId data field.

{code}
6. DUPLICATE STRINGS

Total strings: 29,734,223  Unique strings: 3,011,444  Duplicate values: 309,566 
 Overhead: 2,388,231K (36.6%)

...

===================================================

7. REFERENCE CHAINS FOR DUPLICATE STRINGS

  391,384K (6.0%), 3340329 dup strings (517 unique), 3340329 dup backing arrays:
     <-- String[] <-- org.apache.hadoop.fs.HdfsBlockLocation.storageIds <--  
{j.u.ArrayList} <-- 
org.apache.impala.catalog.HdfsTable$FileBlocksInfo.locations <--  
{j.u.HashMap}.values <-- Java Local (j.u.HashMap) 
[@631ce3948,@6890e01c8,@6a55b0778,@6ac1310f0] ... and 2 more GC roots (6 
thread(s))
  365,345K (5.6%), 3340329 dup strings (28 unique), 3340329 dup backing arrays:
     <-- org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage.datanodeUuid 
<-- org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage[] <-- 
org.apache.hadoop.hdfs.protocol.LocatedBlock.locs <-- 
org.apache.hadoop.fs.HdfsBlockLocation.block <--  {j.u.ArrayList} <-- 
org.apache.impala.catalog.HdfsTable$FileBlocksInfo.locations <--  
{j.u.HashMap}.values <-- Java Local (j.u.HashMap) 
[@631ce3948,@6890e01c8,@6a55b0778,@6ac1310f0] ... and 2 more GC roots (6 
thread(s))
  328,625K (5.0%), 3340328 dup strings (28 unique), 3340328 dup backing arrays:
     <-- String[] <-- org.apache.hadoop.fs.HdfsBlockLocation.hosts <--  
{j.u.ArrayList} <-- 
org.apache.impala.catalog.HdfsTable$FileBlocksInfo.locations <--  
{j.u.HashMap}.values <-- Java Local (j.u.HashMap) 
[@631ce3948,@6890e01c8,@6a55b0778,@6ac1310f0] ... and 2 more GC roots (6 
thread(s))
  313,153K (4.8%), 3340329 dup strings (28 unique), 3340329 dup backing arrays:
     <-- String[] <-- org.apache.hadoop.fs.HdfsBlockLocation.topologyPaths <--  
{j.u.ArrayList} <-- 
org.apache.impala.catalog.HdfsTable$FileBlocksInfo.locations <--  
{j.u.HashMap}.values <-- Java Local (j.u.HashMap) 
[@631ce3948,@6890e01c8,@6a55b0778,@6ac1310f0] ... and 2 more GC roots (6 
thread(s))
  260,961K (4.0%), 3340329 dup strings (28 unique), 3340329 dup backing arrays:
     <-- String[] <-- org.apache.hadoop.fs.HdfsBlockLocation.names <--  
{j.u.ArrayList} <-- 
org.apache.impala.catalog.HdfsTable$FileBlocksInfo.locations <--  
{j.u.HashMap}.values <-- Java Local (j.u.HashMap) 
[@631ce3948,@6890e01c8,@6a55b0778,@6ac1310f0] ... and 2 more GC roots (6 
thread(s))
  208,769K (3.2%), 3340329 dup strings (28 unique), 3340329 dup backing arrays:
     <-- org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage.ipAddr <-- 
org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage[] <-- 
org.apache.hadoop.hdfs.protocol.LocatedBlock.locs <-- 
org.apache.hadoop.fs.HdfsBlockLocation.block <--  {j.u.ArrayList} <-- 
org.apache.impala.catalog.HdfsTable$FileBlocksInfo.locations <--  
{j.u.HashMap}.values <-- Java Local (j.u.HashMap) 
[@631ce3948,@6890e01c8,@6a55b0778,@6ac1310f0] ... and 2 more GC roots (6 
thread(s))
  182,674K (2.8%), 3340329 dup strings (2 unique), 3340329 dup backing arrays:
     <-- org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage.location <-- 
org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage[] <-- 
org.apache.hadoop.hdfs.protocol.LocatedBlock.locs <-- 
org.apache.hadoop.fs.HdfsBlockLocation.block <--  {j.u.ArrayList} <-- 
org.apache.impala.catalog.HdfsTable$FileBlocksInfo.locations <--  
{j.u.HashMap}.values <-- Java Local (j.u.HashMap) 
[@631ce3948,@6890e01c8,@6a55b0778,@6ac1310f0] ... and 2 more GC roots (6 
thread(s))
  130,481K (2.0%), 1113443 dup strings (1 unique), 1113443 dup backing arrays:
     <-- org.apache.hadoop.hdfs.protocol.ExtendedBlock.poolId <-- 
org.apache.hadoop.hdfs.protocol.LocatedBlock.b <-- 
org.apache.hadoop.fs.HdfsBlockLocation.block <--  {j.u.ArrayList} <-- 
org.apache.impala.catalog.HdfsTable$FileBlocksInfo.locations <--  
{j.u.HashMap}.values <-- Java Local (j.u.HashMap) 
[@631ce3948,@6890e01c8,@6a55b0778,@6ac1310f0] ... and 2 more GC roots (6 
thread(s))
  59,974K (0.9%), 640705 dup strings (1000 unique), 640705 dup backing arrays:
     <--  {j.u.HashMap}.keys <--  {j.u.HashMap}.values <-- 
org.apache.impala.catalog.HdfsTable.perPartitionFileDescMap_ <-- Java Local 
(org.apache.impala.catalog.HdfsTable) 
[@631ce38b0,@6890e0130,@6a55b06e0,@6ac131058] ... and 2 more GC roots (6 
thread(s))
  24,437K (0.4%), 252362 dup strings (14230 unique), 252362 dup backing arrays:
     <--  {j.u.HashMap}.keys <--  {j.u.HashMap}.values <-- 
org.apache.impala.catalog.HdfsTable.perPartitionFileDescMap_ <--  
{java.util.concurrent.ConcurrentHashMap}.values <-- 
org.apache.impala.catalog.CatalogObjectCache.metadataCache_ <-- 
org.apache.impala.catalog.Db.tableCache_ <-- Java Local 
(org.apache.impala.catalog.Db) [@6015b2e90,@601c41930] (2 thread(s))
{code}

> String duplication in org.apache.hadoop.fs.BlockLocation
> --------------------------------------------------------
>
>                 Key: HDFS-11383
>                 URL: https://issues.apache.org/jira/browse/HDFS-11383
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Misha Dmitriev
>            Assignee: Misha Dmitriev
>         Attachments: HDFS-11383.01.patch, HDFS-11383.02.patch, hs2-crash-2.txt
>
>
> I am working on Hive performance, investigating the problem of high memory 
> pressure when (a) a table consists of a high number (thousands) of partitions 
> and (b) multiple queries run against it concurrently. It turns out that a lot 
> of memory is wasted due to data duplication. One source of duplicate strings 
> is class org.apache.hadoop.fs.BlockLocation. Its fields such as storageIds, 
> topologyPaths, hosts, names, may collectively use up to 6% of memory in my 
> benchmark, causing (together with other problematic classes) a huge memory 
> spike. Of these 6% of memory taken by BlockLocation strings, more than 5% are 
> wasted due to duplication.
> I think we need to add calls to String.intern() in the BlockLocation 
> constructor, like:
> {code}
> this.hosts = internStringsInArray(hosts);
> ...
> private void internStringsInArray(String[] sar) {
>   for (int i = 0; i < sar.length; i++) {
>     sar[i] = sar[i].intern();
>   }
> }
> {code}
> String.intern() performs very well starting from JDK 7. I've found some 
> articles explaining the progress that was made by the HotSpot JVM developers 
> in this area, verified that with benchmarks myself, and finally added quite a 
> bit of interning to one of the Cloudera products without any issues.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to