[
https://issues.apache.org/jira/browse/SENTRY-1827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Misha Dmitriev updated SENTRY-1827:
-----------------------------------
Description:
We obtained a heap dump taken from the JVM running Hive Metastore at the time
when Sentry HDFS sync operation was performed. I've analyzed this dump with
jxray (www.jxray.com) and found that a significant percentage of memory is
wasted due to duplicate strings:
{code}
7. DUPLICATE STRINGS
Total strings: 29,986,017 Unique strings: 9,640,413 Duplicate values:
4,897,743 Overhead: 2,570,746K (9.4%)
{code}
Of them, more than 1/3 come from sentry:
{code}
917,331K (3.3%), 10517636 dup strings (498477 unique), 10517636 dup backing
arrays:
<-- org.apache.sentry.hdfs.service.thrift.TPathEntry.pathElement <--
{j.u.HashMap}.values <--
org.apache.sentry.hdfs.service.thrift.TPathsDump.nodeMap <--
org.apache.sentry.hdfs.service.thrift.TPathsUpdate.pathsDump <-- Java
Local@7fea0851c360 (org.apache.sentry.hdfs.service.thrift.TPathsUpdate)
{code}
The duplicate strings in memory have been eliminated by SENTRY-1811. However,
when these strings are serialized into the TPathsDump thrift message, they are
duplicated again. That is, if there are 3 different TPathEntry objects with the
same pathElement="foo", then (even if there is only one interned copy of the
"foo" string in memory), a separate copy of "foo" will be written to the
serialized message for each of these 3 TPathEntries. This is one reason why
TPathsDump serialized messages may get very big, consume a lot of memory and
take long time to send over the network.
To address this problem we may use some form of custom compression, where we
don't write multiple copies of duplicate strings, but rather substitute them
with some shorter "string ids".
was:
We obtained a heap dump taken from the JVM running Hive Metastore at the time
when Sentry HDFS sync operation was performed. I've analyzed this dump with
jxray (www.jxray.com) and found that more than 19% of memory is wasted due to
empty or suboptimally-sized Java collections:
{code}
9. BAD COLLECTIONS
Total collections: 54,057,249 Bad collections: 31,569,606 Overhead:
5,292,821K (19.3%)
{code}
Most of these collections come from thrift classes used by the Sentry plugin,
see below. The associated memory waste can be significantly reduced or
eliminated if these collections were allocated lazily and then with the initial
capacity smaller than the default 16 elements for HashMap/HashSet.
{code}
1,869,023K (6.8%): j.u.HashSet: 3388670 of 1-elem 979,537K (3.6%), 5897806 of
empty 552,919K (2.0%), 1010321 of small 336,566K (1.2%)
<-- org.apache.sentry.hdfs.service.thrift.TPathEntry.children <--
{j.u.HashMap}.values <--
org.apache.sentry.hdfs.service.thrift.TPathsDump.nodeMap <--
org.apache.sentry.hdfs.service.thrift.TPathsUpdate.pathsDump <-- Java
Local@7fea0851c360 (org.apache.sentry.hdfs.service.thrift.TPathsUpdate)
1,190,050K (4.3%): j.u.HashMap: 3382765 of 1-elem 898,546K (3.3%), 1005341 of
small 291,503K (1.1%)
<-- org.apache.sentry.hdfs.HMSPaths$Entry.children <--
org.apache.sentry.hdfs.HMSPaths$Entry.{parent} <-- {j.u.HashSet} <--
{j.u.TreeMap}.values <-- org.apache.sentry.hdfs.HMSPaths.authzObjToPath <--
org.apache.sentry.hdfs.UpdateableAuthzPaths.paths <--
org.apache.sentry.hdfs.MetastorePlugin.authzPaths <-- Java Local@7fe4fe84e030
(org.apache.sentry.hdfs.MetastorePlugin)
969,442K (3.5%): j.u.TreeSet: 5907188 of 1-elem 969,148K (3.5%)
<-- org.apache.sentry.hdfs.service.thrift.TPathEntry.authzObjs <--
{j.u.HashMap}.values <--
org.apache.sentry.hdfs.service.thrift.TPathsDump.nodeMap <--
org.apache.sentry.hdfs.service.thrift.TPathsUpdate.pathsDump <-- Java
Local@7fea0851c360 (org.apache.sentry.hdfs.service.thrift.TPathsUpdate)
487,690K (1.8%): j.u.TreeSet: 4801877 of empty 487,690K (1.8%)
<-- org.apache.sentry.hdfs.HMSPaths$Entry.authzObjs <--
org.apache.sentry.hdfs.HMSPaths$Entry.{parent} <-- {j.u.HashSet} <--
{j.u.TreeMap}.values <-- org.apache.sentry.hdfs.HMSPaths.authzObjToPath <--
org.apache.sentry.hdfs.UpdateableAuthzPaths.paths <--
org.apache.sentry.hdfs.MetastorePlugin.authzPaths <-- Java Local@7fe4fe84e030
(org.apache.sentry.hdfs.MetastorePlugin)
415,064K (1.5%): j.u.HashMap: 5897806 of empty 414,689K (1.5%)
<-- org.apache.sentry.hdfs.HMSPaths$Entry.children <-- {j.u.HashSet} <--
{j.u.TreeMap}.values <-- org.apache.sentry.hdfs.HMSPaths.authzObjToPath <--
org.apache.sentry.hdfs.UpdateableAuthzPaths.paths <--
org.apache.sentry.hdfs.MetastorePlugin.authzPaths <-- Java Local@7fe4fe84e030
(org.apache.sentry.hdfs.MetastorePlugin)
{code}
Additionally, a significant percentage of memory is wasted due to duplicate
strings:
{code}
7. DUPLICATE STRINGS
Total strings: 29,986,017 Unique strings: 9,640,413 Duplicate values:
4,897,743 Overhead: 2,570,746K (9.4%)
{code}
Of them, more than 1/3 come from sentry:
{code}
917,331K (3.3%), 10517636 dup strings (498477 unique), 10517636 dup backing
arrays:
<-- org.apache.sentry.hdfs.service.thrift.TPathEntry.pathElement <--
{j.u.HashMap}.values <-- org.apache.sen
try.hdfs.service.thrift.TPathsDump.nodeMap <--
org.apache.sentry.hdfs.service.thrift.TPathsUpdate.pathsDump <-- Ja
va Local@7fea0851c360 (org.apache.sentry.hdfs.service.thrift.TPathsUpdate)
{code}
These can be eliminated by inserting String.intern() calls in the appropriate
places.
> Minimize TPathsDump thrift message used in HDFS sync
> ----------------------------------------------------
>
> Key: SENTRY-1827
> URL: https://issues.apache.org/jira/browse/SENTRY-1827
> Project: Sentry
> Issue Type: Improvement
> Affects Versions: 1.8.0
> Reporter: Misha Dmitriev
> Assignee: Misha Dmitriev
> Fix For: 1.8.0
>
>
> We obtained a heap dump taken from the JVM running Hive Metastore at the time
> when Sentry HDFS sync operation was performed. I've analyzed this dump with
> jxray (www.jxray.com) and found that a significant percentage of memory is
> wasted due to duplicate strings:
> {code}
> 7. DUPLICATE STRINGS
> Total strings: 29,986,017 Unique strings: 9,640,413 Duplicate values:
> 4,897,743 Overhead: 2,570,746K (9.4%)
> {code}
> Of them, more than 1/3 come from sentry:
> {code}
> 917,331K (3.3%), 10517636 dup strings (498477 unique), 10517636 dup backing
> arrays:
> <-- org.apache.sentry.hdfs.service.thrift.TPathEntry.pathElement <--
> {j.u.HashMap}.values <--
> org.apache.sentry.hdfs.service.thrift.TPathsDump.nodeMap <--
> org.apache.sentry.hdfs.service.thrift.TPathsUpdate.pathsDump <-- Java
> Local@7fea0851c360 (org.apache.sentry.hdfs.service.thrift.TPathsUpdate)
> {code}
> The duplicate strings in memory have been eliminated by SENTRY-1811. However,
> when these strings are serialized into the TPathsDump thrift message, they
> are duplicated again. That is, if there are 3 different TPathEntry objects
> with the same pathElement="foo", then (even if there is only one interned
> copy of the "foo" string in memory), a separate copy of "foo" will be written
> to the serialized message for each of these 3 TPathEntries. This is one
> reason why TPathsDump serialized messages may get very big, consume a lot of
> memory and take long time to send over the network.
> To address this problem we may use some form of custom compression, where we
> don't write multiple copies of duplicate strings, but rather substitute them
> with some shorter "string ids".
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)