-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60843/
-----------------------------------------------------------
Review request for sentry, Alexander Kolbasov and kalyan kumar kalvagadda.
Repository: sentry
Description
-------
We obtained a heap dump taken from the JVM running Hive Metastore at the time
when Sentry HDFS sync operation was performed. I've analyzed this dump with
jxray (www.jxray.com) and found that a significant percentage of memory is
wasted due to duplicate strings:
{code}
7. DUPLICATE STRINGS
Total strings: 29,986,017 Unique strings: 9,640,413 Duplicate values:
4,897,743 Overhead: 2,570,746K (9.4%)
{code}
Of them, more than 1/3 come from sentry:
{code}
917,331K (3.3%), 10517636 dup strings (498477 unique), 10517636 dup backing
arrays:
<-- org.apache.sentry.hdfs.service.thrift.TPathEntry.pathElement <--
{j.u.HashMap}.values <--
org.apache.sentry.hdfs.service.thrift.TPathsDump.nodeMap <--
org.apache.sentry.hdfs.service.thrift.TPathsUpdate.pathsDump <-- Java
Local@7fea0851c360 (org.apache.sentry.hdfs.service.thrift.TPathsUpdate)
{code}
The duplicate strings in memory have been eliminated by SENTRY-1811. However,
when these strings are serialized into the TPathsDump thrift message, they are
duplicated again. That is, if there are 3 different TPathEntry objects with the
same pathElement="foo", then (even if there is only one interned copy of the
"foo" string in memory), a separate copy of "foo" will be written to the
serialized message for each of these 3 TPathEntries. This is one reason why
TPathsDump serialized messages may get very big, consume a lot of memory and
take long time to send over the network.
To address this problem we may use some form of custom compression, where we
don't write multiple copies of duplicate strings, but rather substitute them
with some shorter "string ids".
Diffs
-----
sentry-hdfs/sentry-hdfs-common/src/gen/thrift/gen-javabean/org/apache/sentry/hdfs/service/thrift/TPathsDump.java
722ad76d9
sentry-hdfs/sentry-hdfs-common/src/main/java/org/apache/sentry/hdfs/AuthzPathsDumper.java
095095710
sentry-hdfs/sentry-hdfs-common/src/main/java/org/apache/sentry/hdfs/HMSPathsDumper.java
479188e51
sentry-hdfs/sentry-hdfs-common/src/main/java/org/apache/sentry/hdfs/Updateable.java
e777e4b1a
sentry-hdfs/sentry-hdfs-common/src/main/java/org/apache/sentry/hdfs/UpdateableAuthzPaths.java
08a3b3e92
sentry-hdfs/sentry-hdfs-common/src/main/resources/sentry_hdfs_service.thrift
b0a1f877b
sentry-hdfs/sentry-hdfs-common/src/test/java/org/apache/sentry/hdfs/TestHMSPathsFullDump.java
194ffb755
sentry-hdfs/sentry-hdfs-common/src/test/java/org/apache/sentry/hdfs/TestUpdateableAuthzPaths.java
9a726da27
sentry-hdfs/sentry-hdfs-namenode-plugin/src/main/java/org/apache/sentry/hdfs/UpdateableAuthzPermissions.java
89a3297d4
sentry-hdfs/sentry-hdfs-service/src/main/java/org/apache/sentry/hdfs/PathImageRetriever.java
2426b4079
Diff: https://reviews.apache.org/r/60843/diff/1/
Testing
-------
Thanks,
Arjun Mishra