-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/60843/#review180548
-----------------------------------------------------------


Ship it!




Ship It!

- Arjun Mishra


On July 13, 2017, 6:03 p.m., Arjun Mishra wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/60843/
> -----------------------------------------------------------
> 
> (Updated July 13, 2017, 6:03 p.m.)
> 
> 
> Review request for sentry, Alexander Kolbasov and kalyan kumar kalvagadda.
> 
> 
> Repository: sentry
> 
> 
> Description
> -------
> 
> We obtained a heap dump taken from the JVM running Hive Metastore at the time 
> when Sentry HDFS sync operation was performed. I've analyzed this dump with 
> jxray (www.jxray.com) and found that  a significant percentage of memory is 
> wasted due to duplicate strings:
> 
> {code}
> 7. DUPLICATE STRINGS
> 
> Total strings: 29,986,017  Unique strings: 9,640,413  Duplicate values: 
> 4,897,743  Overhead: 2,570,746K (9.4%)
> {code}
> 
> Of them, more than 1/3 come from sentry:
> 
> {code}
>   917,331K (3.3%), 10517636 dup strings (498477 unique), 10517636 dup backing 
> arrays:
>      <-- org.apache.sentry.hdfs.service.thrift.TPathEntry.pathElement <--  
> {j.u.HashMap}.values <-- 
> org.apache.sentry.hdfs.service.thrift.TPathsDump.nodeMap <-- 
> org.apache.sentry.hdfs.service.thrift.TPathsUpdate.pathsDump <-- Java 
> Local@7fea0851c360 (org.apache.sentry.hdfs.service.thrift.TPathsUpdate)
> {code}
> 
> The duplicate strings in memory have been eliminated by SENTRY-1811. However, 
> when these strings are serialized into the TPathsDump thrift message, they 
> are duplicated again. That is, if there are 3 different TPathEntry objects 
> with the same pathElement="foo", then (even if there is only one interned 
> copy of the "foo" string in memory), a separate copy of "foo" will be written 
> to the serialized message for each of these 3 TPathEntries. This is one 
> reason why TPathsDump serialized messages may get very big, consume a lot of 
> memory and take long time to send over the network.
> 
> To address this problem we may use some form of custom compression, where we 
> don't write multiple copies of duplicate strings, but rather substitute them 
> with some shorter "string ids".
> 
> 
> Diffs
> -----
> 
>   
> sentry-hdfs/sentry-hdfs-common/src/gen/thrift/gen-javabean/org/apache/sentry/hdfs/service/thrift/TPathsDump.java
>  722ad76d9 
>   
> sentry-hdfs/sentry-hdfs-common/src/main/java/org/apache/sentry/hdfs/AuthzPathsDumper.java
>  095095710 
>   
> sentry-hdfs/sentry-hdfs-common/src/main/java/org/apache/sentry/hdfs/HMSPathsDumper.java
>  479188e51 
>   
> sentry-hdfs/sentry-hdfs-common/src/main/java/org/apache/sentry/hdfs/Updateable.java
>  e777e4b1a 
>   
> sentry-hdfs/sentry-hdfs-common/src/main/java/org/apache/sentry/hdfs/UpdateableAuthzPaths.java
>  08a3b3e92 
>   
> sentry-hdfs/sentry-hdfs-common/src/main/resources/sentry_hdfs_service.thrift 
> b0a1f877b 
>   
> sentry-hdfs/sentry-hdfs-common/src/test/java/org/apache/sentry/hdfs/TestHMSPathsFullDump.java
>  194ffb755 
>   
> sentry-hdfs/sentry-hdfs-common/src/test/java/org/apache/sentry/hdfs/TestUpdateableAuthzPaths.java
>  9a726da27 
>   
> sentry-hdfs/sentry-hdfs-namenode-plugin/src/main/java/org/apache/sentry/hdfs/UpdateableAuthzPermissions.java
>  89a3297d4 
>   
> sentry-hdfs/sentry-hdfs-service/src/main/java/org/apache/sentry/hdfs/PathImageRetriever.java
>  2426b4079 
> 
> 
> Diff: https://reviews.apache.org/r/60843/diff/1/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Arjun Mishra
> 
>

Reply via email to