[
https://issues.apache.org/jira/browse/HIVE-17237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16116963#comment-16116963
]
Misha Dmitriev commented on HIVE-17237:
---------------------------------------
This is to save memory and improve performance. String.intern() has always been
an "official" solution to the string duplication problem. However, until JDK 7
it was not very scalable. This forced people to start using their own interners
based on WeakHashMap or ConcurrentHashMap. But, as we know, these data
structures are not economical at all in terms of memory - there is an overhead
of 32 bytes or more per an interned string. Starting from JDK 7, Sun/Oracle
finally paid attention and made several improvements to String.intern(), that
greatly improved its performance. The internal hashtable used by
String.intern() is also much more economical in terms of memory, and
preallocated. So since JDK 7, it became counterproductive to use custom string
interners.
> HMS wastes 26.4% of memory due to dup strings in
> metastore.api.Partition.parameters
> -----------------------------------------------------------------------------------
>
> Key: HIVE-17237
> URL: https://issues.apache.org/jira/browse/HIVE-17237
> Project: Hive
> Issue Type: Improvement
> Components: HiveServer2
> Reporter: Misha Dmitriev
> Assignee: Misha Dmitriev
> Attachments: HIVE-17237.01.patch
>
>
> I've analyzed a heap dump from a production Hive installation using jxray
> (www.jxray.com) It turns out that there are a lot of duplicate strings in
> memory, that waste 26.4% of the heap. Most of them come from HashMaps
> referenced by org.apache.hadoop.hive.metastore.api.Partition.parameters.
> Below is the relevant section of the jxray report.
> Looking at Partition.java, I see that in the past somebody has already added
> code to intern keys and values in the parameters table when it's first set
> up. However, when more key-value pairs are added, they are not interned, and
> that probably explains the reason for all these duplicate strings. Also when
> a Partition instance is deserialized, no interning of parameters is currently
> done.
> {code}
> 6. DUPLICATE STRINGS
> Total strings: 3,273,557 Unique strings: 460,390 Duplicate values: 110,232
> Overhead: 3,220,458K (26.4%)
> ....
> ===================================================
> 7. REFERENCE CHAINS FOR DUPLICATE STRINGS
> 2,326,150K (19.1%), 597058 dup strings (36386 unique), 597058 dup backing
> arrays:
> 39949 of "-1", 39088 of "true", 28959 of "8", 20987 of "1", 18437 of "10",
> 9583 of "9", 5908 of "269664", 5691 of "174528", 4598 of "133980", 4598 of
> "BgUGBQgFCAYFCgYIBgUEBgQHBgUGCwYGBwYHBgkKBwYGBggIBwUHBgYGCgUJCQUG ...[length
> 3560]"
> ... and 419200 more strings, of which 36376 are unique
> Also contains one-char strings: 217 of "6", 147 of "7", 91 of "4", 28 of "5",
> 28 of "2", 21 of "0"
> <-- {j.u.HashMap}.values <--
> org.apache.hadoop.hive.metastore.api.Partition.parameters <--
> {j.u.ArrayList} <--
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partitions_result.success
> <-- Java Local
> (org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partitions_result)
> [@6e33618d8,@6eedb9a80,@6eedbad68,@6eedbc788] ... and 3 more GC roots
> 463,060K (3.8%), 119644 dup strings (34075 unique), 119644 dup backing
> arrays:
> 7914 of "true", 7912 of "-1", 6578 of "8", 5606 of "1", 2302 of "10", 1626 of
> "174528", 1223 of "9", 970 of "171680", 837 of "269664", 657 of "133980"
> ... and 84009 more strings, of which 34065 are unique
> Also contains one-char strings: 42 of "7", 31 of "6", 20 of "4", 8 of "5", 5
> of "2", 3 of "0"
> <-- {j.u.HashMap}.values <--
> org.apache.hadoop.hive.metastore.api.Partition.parameters <--
> {j.u.TreeMap}.values <-- Java Local (j.u.TreeMap) [@6f084afa0,@73aac9e68]
> 233,384K (1.9%), 64601 dup strings (27295 unique), 64601 dup backing arrays:
> 4472 of "true", 4173 of "-1", 3798 of "1", 3591 of "8", 813 of "174528", 684
> of "10" ... and 44568 more strings, of which 27285 are unique
> Also contains one-char strings: 305 of "7", 301 of "0", 277 of "4", 146 of
> "6", 29 of "2", 23 of "5", 19 of "9", 2 of "3"
> <-- {j.u.HashMap}.values <--
> org.apache.hadoop.hive.metastore.api.Partition.parameters <--
> {j.u.ArrayList} <-- Java Local (j.u.ArrayList)
> [@4f4cfbd10,@536122408,@726616778]
> ...
> 52,916K (0.4%), 597058 dup strings (16 unique), 597058 dup backing arrays:
> <-- {j.u.HashMap}.keys <--
> org.apache.hadoop.hive.metastore.api.Partition.parameters <--
> {j.u.ArrayList} <--
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partitions_result.success
> <-- Java Local
> (org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partitions_result)
> [@6e33618d8,@6eedb9a80,@6eedbad68,@6eedbc788] ... and 3 more GC roots
> {code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)