[
https://issues.apache.org/jira/browse/HDFS-11383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Manoj Govindassamy reassigned HDFS-11383:
-----------------------------------------
Assignee: Manoj Govindassamy
> String duplication in org.apache.hadoop.fs.BlockLocation
> --------------------------------------------------------
>
> Key: HDFS-11383
> URL: https://issues.apache.org/jira/browse/HDFS-11383
> Project: Hadoop HDFS
> Issue Type: Improvement
> Reporter: Misha Dmitriev
> Assignee: Manoj Govindassamy
>
> I am working on Hive performance, investigating the problem of high memory
> pressure when (a) a table consists of a high number (thousands) of partitions
> and (b) multiple queries run against it concurrently. It turns out that a lot
> of memory is wasted due to data duplication. One source of duplicate strings
> is class org.apache.hadoop.fs.BlockLocation. Its fields such as storageIds,
> topologyPaths, hosts, names, may collectively use up to 6% of memory in my
> benchmark, causing (together with other problematic classes) a huge memory
> spike. Of these 6% of memory taken by BlockLocation strings, more than 5% are
> wasted due to duplication.
> I think we need to add calls to String.intern() in the BlockLocation
> constructor, like:
> {code}
> this.hosts = internStringsInArray(hosts);
> ...
> private void internStringsInArray(String[] sar) {
> for (int i = 0; i < sar.length; i++) {
> sar[i] = sar[i].intern();
> }
> }
> {code}
> String.intern() performs very well starting from JDK 7. I've found some
> articles explaining the progress that was made by the HotSpot JVM developers
> in this area, verified that with benchmarks myself, and finally added quite a
> bit of interning to one of the Cloudera products without any issues.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]