[ https://issues.apache.org/jira/browse/HDFS-11383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025591#comment-16025591 ]
Misha Dmitriev commented on HDFS-11383: --------------------------------------- Here is an excerpt from another jxray report, obtained from a real-life production run of Impala Catalog Server. It turns out that it wastes 36% of its memory due to duplicate strings, and of them at least 20% come from hadoop.fs.BlockLocation data fields. Some more duplicates come from other HDFS classes: org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage and org.apache.hadoop.hdfs.protocol.ExtendedBlock. The second patch that I've just added has string interning added for ExtendedBlock.poolId data field. {code} 6. DUPLICATE STRINGS Total strings: 29,734,223 Unique strings: 3,011,444 Duplicate values: 309,566 Overhead: 2,388,231K (36.6%) ... =================================================== 7. REFERENCE CHAINS FOR DUPLICATE STRINGS 391,384K (6.0%), 3340329 dup strings (517 unique), 3340329 dup backing arrays: <-- String[] <-- org.apache.hadoop.fs.HdfsBlockLocation.storageIds <-- {j.u.ArrayList} <-- org.apache.impala.catalog.HdfsTable$FileBlocksInfo.locations <-- {j.u.HashMap}.values <-- Java Local (j.u.HashMap) [@631ce3948,@6890e01c8,@6a55b0778,@6ac1310f0] ... and 2 more GC roots (6 thread(s)) 365,345K (5.6%), 3340329 dup strings (28 unique), 3340329 dup backing arrays: <-- org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage.datanodeUuid <-- org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage[] <-- org.apache.hadoop.hdfs.protocol.LocatedBlock.locs <-- org.apache.hadoop.fs.HdfsBlockLocation.block <-- {j.u.ArrayList} <-- org.apache.impala.catalog.HdfsTable$FileBlocksInfo.locations <-- {j.u.HashMap}.values <-- Java Local (j.u.HashMap) [@631ce3948,@6890e01c8,@6a55b0778,@6ac1310f0] ... and 2 more GC roots (6 thread(s)) 328,625K (5.0%), 3340328 dup strings (28 unique), 3340328 dup backing arrays: <-- String[] <-- org.apache.hadoop.fs.HdfsBlockLocation.hosts <-- {j.u.ArrayList} <-- org.apache.impala.catalog.HdfsTable$FileBlocksInfo.locations <-- {j.u.HashMap}.values <-- Java Local (j.u.HashMap) [@631ce3948,@6890e01c8,@6a55b0778,@6ac1310f0] ... and 2 more GC roots (6 thread(s)) 313,153K (4.8%), 3340329 dup strings (28 unique), 3340329 dup backing arrays: <-- String[] <-- org.apache.hadoop.fs.HdfsBlockLocation.topologyPaths <-- {j.u.ArrayList} <-- org.apache.impala.catalog.HdfsTable$FileBlocksInfo.locations <-- {j.u.HashMap}.values <-- Java Local (j.u.HashMap) [@631ce3948,@6890e01c8,@6a55b0778,@6ac1310f0] ... and 2 more GC roots (6 thread(s)) 260,961K (4.0%), 3340329 dup strings (28 unique), 3340329 dup backing arrays: <-- String[] <-- org.apache.hadoop.fs.HdfsBlockLocation.names <-- {j.u.ArrayList} <-- org.apache.impala.catalog.HdfsTable$FileBlocksInfo.locations <-- {j.u.HashMap}.values <-- Java Local (j.u.HashMap) [@631ce3948,@6890e01c8,@6a55b0778,@6ac1310f0] ... and 2 more GC roots (6 thread(s)) 208,769K (3.2%), 3340329 dup strings (28 unique), 3340329 dup backing arrays: <-- org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage.ipAddr <-- org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage[] <-- org.apache.hadoop.hdfs.protocol.LocatedBlock.locs <-- org.apache.hadoop.fs.HdfsBlockLocation.block <-- {j.u.ArrayList} <-- org.apache.impala.catalog.HdfsTable$FileBlocksInfo.locations <-- {j.u.HashMap}.values <-- Java Local (j.u.HashMap) [@631ce3948,@6890e01c8,@6a55b0778,@6ac1310f0] ... and 2 more GC roots (6 thread(s)) 182,674K (2.8%), 3340329 dup strings (2 unique), 3340329 dup backing arrays: <-- org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage.location <-- org.apache.hadoop.hdfs.protocol.DatanodeInfoWithStorage[] <-- org.apache.hadoop.hdfs.protocol.LocatedBlock.locs <-- org.apache.hadoop.fs.HdfsBlockLocation.block <-- {j.u.ArrayList} <-- org.apache.impala.catalog.HdfsTable$FileBlocksInfo.locations <-- {j.u.HashMap}.values <-- Java Local (j.u.HashMap) [@631ce3948,@6890e01c8,@6a55b0778,@6ac1310f0] ... and 2 more GC roots (6 thread(s)) 130,481K (2.0%), 1113443 dup strings (1 unique), 1113443 dup backing arrays: <-- org.apache.hadoop.hdfs.protocol.ExtendedBlock.poolId <-- org.apache.hadoop.hdfs.protocol.LocatedBlock.b <-- org.apache.hadoop.fs.HdfsBlockLocation.block <-- {j.u.ArrayList} <-- org.apache.impala.catalog.HdfsTable$FileBlocksInfo.locations <-- {j.u.HashMap}.values <-- Java Local (j.u.HashMap) [@631ce3948,@6890e01c8,@6a55b0778,@6ac1310f0] ... and 2 more GC roots (6 thread(s)) 59,974K (0.9%), 640705 dup strings (1000 unique), 640705 dup backing arrays: <-- {j.u.HashMap}.keys <-- {j.u.HashMap}.values <-- org.apache.impala.catalog.HdfsTable.perPartitionFileDescMap_ <-- Java Local (org.apache.impala.catalog.HdfsTable) [@631ce38b0,@6890e0130,@6a55b06e0,@6ac131058] ... and 2 more GC roots (6 thread(s)) 24,437K (0.4%), 252362 dup strings (14230 unique), 252362 dup backing arrays: <-- {j.u.HashMap}.keys <-- {j.u.HashMap}.values <-- org.apache.impala.catalog.HdfsTable.perPartitionFileDescMap_ <-- {java.util.concurrent.ConcurrentHashMap}.values <-- org.apache.impala.catalog.CatalogObjectCache.metadataCache_ <-- org.apache.impala.catalog.Db.tableCache_ <-- Java Local (org.apache.impala.catalog.Db) [@6015b2e90,@601c41930] (2 thread(s)) {code} > String duplication in org.apache.hadoop.fs.BlockLocation > -------------------------------------------------------- > > Key: HDFS-11383 > URL: https://issues.apache.org/jira/browse/HDFS-11383 > Project: Hadoop HDFS > Issue Type: Improvement > Reporter: Misha Dmitriev > Assignee: Misha Dmitriev > Attachments: HDFS-11383.01.patch, HDFS-11383.02.patch, hs2-crash-2.txt > > > I am working on Hive performance, investigating the problem of high memory > pressure when (a) a table consists of a high number (thousands) of partitions > and (b) multiple queries run against it concurrently. It turns out that a lot > of memory is wasted due to data duplication. One source of duplicate strings > is class org.apache.hadoop.fs.BlockLocation. Its fields such as storageIds, > topologyPaths, hosts, names, may collectively use up to 6% of memory in my > benchmark, causing (together with other problematic classes) a huge memory > spike. Of these 6% of memory taken by BlockLocation strings, more than 5% are > wasted due to duplication. > I think we need to add calls to String.intern() in the BlockLocation > constructor, like: > {code} > this.hosts = internStringsInArray(hosts); > ... > private void internStringsInArray(String[] sar) { > for (int i = 0; i < sar.length; i++) { > sar[i] = sar[i].intern(); > } > } > {code} > String.intern() performs very well starting from JDK 7. I've found some > articles explaining the progress that was made by the HotSpot JVM developers > in this area, verified that with benchmarks myself, and finally added quite a > bit of interning to one of the Cloudera products without any issues. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org