[ 
https://issues.apache.org/jira/browse/HDFS-11383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16033839#comment-16033839
 ] 

Hudson commented on HDFS-11383:
-------------------------------

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #11815 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/11815/])
HDFS-11383. Intern strings in BlockLocation and ExtendedBlock. (wang: rev 
7101477e4726a70ab0eab57c2d4480a04446a8dc)
* (edit) 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/StringInterner.java
* (edit) 
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/protocol/ExtendedBlock.java
* (edit) 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/BlockLocation.java


> Intern strings in BlockLocation and ExtendedBlock
> -------------------------------------------------
>
>                 Key: HDFS-11383
>                 URL: https://issues.apache.org/jira/browse/HDFS-11383
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Misha Dmitriev
>            Assignee: Misha Dmitriev
>             Fix For: 2.9.0, 3.0.0-alpha4
>
>         Attachments: HDFS-11383.01.patch, HDFS-11383.02.patch, 
> HDFS-11383.03.patch, HDFS-11383.04.patch, hs2-crash-2.txt
>
>
> I am working on Hive performance, investigating the problem of high memory 
> pressure when (a) a table consists of a high number (thousands) of partitions 
> and (b) multiple queries run against it concurrently. It turns out that a lot 
> of memory is wasted due to data duplication. One source of duplicate strings 
> is class org.apache.hadoop.fs.BlockLocation. Its fields such as storageIds, 
> topologyPaths, hosts, names, may collectively use up to 6% of memory in my 
> benchmark, causing (together with other problematic classes) a huge memory 
> spike. Of these 6% of memory taken by BlockLocation strings, more than 5% are 
> wasted due to duplication.
> I think we need to add calls to String.intern() in the BlockLocation 
> constructor, like:
> {code}
> this.hosts = internStringsInArray(hosts);
> ...
> private void internStringsInArray(String[] sar) {
>   for (int i = 0; i < sar.length; i++) {
>     sar[i] = sar[i].intern();
>   }
> }
> {code}
> String.intern() performs very well starting from JDK 7. I've found some 
> articles explaining the progress that was made by the HotSpot JVM developers 
> in this area, verified that with benchmarks myself, and finally added quite a 
> bit of interning to one of the Cloudera products without any issues.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to