[ 
https://issues.apache.org/jira/browse/SENTRY-1827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Misha Dmitriev updated SENTRY-1827:
-----------------------------------
    Affects Version/s:     (was: sentry-ha-redesign)

> Minimize TPathsDump thrift message used in HDFS sync
> ----------------------------------------------------
>
>                 Key: SENTRY-1827
>                 URL: https://issues.apache.org/jira/browse/SENTRY-1827
>             Project: Sentry
>          Issue Type: Improvement
>    Affects Versions: 1.8.0
>            Reporter: Misha Dmitriev
>            Assignee: Misha Dmitriev
>             Fix For: 1.8.0
>
>
> We obtained a heap dump taken from the JVM running Hive Metastore at the time 
> when Sentry HDFS sync operation was performed. I've analyzed this dump with 
> jxray (www.jxray.com) and found that more than 19% of memory is wasted due to 
> empty or suboptimally-sized Java collections:
> {code}
> 9. BAD COLLECTIONS
> Total collections: 54,057,249  Bad collections: 31,569,606  Overhead: 
> 5,292,821K (19.3%)
> {code}
> Most of these collections come from thrift classes used by the Sentry plugin, 
> see below. The associated memory waste can be significantly reduced or 
> eliminated if these collections were allocated lazily and then with the 
> initial capacity smaller than the default 16 elements for HashMap/HashSet.
> {code}
>   1,869,023K (6.8%): j.u.HashSet: 3388670 of 1-elem 979,537K (3.6%), 5897806 
> of empty 552,919K (2.0%), 1010321 of small 336,566K (1.2%)
>      <-- org.apache.sentry.hdfs.service.thrift.TPathEntry.children <--  
> {j.u.HashMap}.values <-- 
> org.apache.sentry.hdfs.service.thrift.TPathsDump.nodeMap <-- 
> org.apache.sentry.hdfs.service.thrift.TPathsUpdate.pathsDump <-- Java 
> Local@7fea0851c360 (org.apache.sentry.hdfs.service.thrift.TPathsUpdate)
>   1,190,050K (4.3%): j.u.HashMap: 3382765 of 1-elem 898,546K (3.3%), 1005341 
> of small 291,503K (1.1%)
>      <-- org.apache.sentry.hdfs.HMSPaths$Entry.children <-- 
> org.apache.sentry.hdfs.HMSPaths$Entry.{parent} <--  {j.u.HashSet} <--  
> {j.u.TreeMap}.values <-- org.apache.sentry.hdfs.HMSPaths.authzObjToPath <-- 
> org.apache.sentry.hdfs.UpdateableAuthzPaths.paths <-- 
> org.apache.sentry.hdfs.MetastorePlugin.authzPaths <-- Java Local@7fe4fe84e030 
> (org.apache.sentry.hdfs.MetastorePlugin)
>   969,442K (3.5%): j.u.TreeSet: 5907188 of 1-elem 969,148K (3.5%)
>      <-- org.apache.sentry.hdfs.service.thrift.TPathEntry.authzObjs <--  
> {j.u.HashMap}.values <-- 
> org.apache.sentry.hdfs.service.thrift.TPathsDump.nodeMap <-- 
> org.apache.sentry.hdfs.service.thrift.TPathsUpdate.pathsDump <-- Java 
> Local@7fea0851c360 (org.apache.sentry.hdfs.service.thrift.TPathsUpdate)
>   487,690K (1.8%): j.u.TreeSet: 4801877 of empty 487,690K (1.8%)
>      <-- org.apache.sentry.hdfs.HMSPaths$Entry.authzObjs <-- 
> org.apache.sentry.hdfs.HMSPaths$Entry.{parent} <--  {j.u.HashSet} <--  
> {j.u.TreeMap}.values <-- org.apache.sentry.hdfs.HMSPaths.authzObjToPath <-- 
> org.apache.sentry.hdfs.UpdateableAuthzPaths.paths <-- 
> org.apache.sentry.hdfs.MetastorePlugin.authzPaths <-- Java Local@7fe4fe84e030 
> (org.apache.sentry.hdfs.MetastorePlugin)
>   415,064K (1.5%): j.u.HashMap: 5897806 of empty 414,689K (1.5%)
>      <-- org.apache.sentry.hdfs.HMSPaths$Entry.children <--  {j.u.HashSet} 
> <--  {j.u.TreeMap}.values <-- org.apache.sentry.hdfs.HMSPaths.authzObjToPath 
> <-- org.apache.sentry.hdfs.UpdateableAuthzPaths.paths <-- 
> org.apache.sentry.hdfs.MetastorePlugin.authzPaths <-- Java Local@7fe4fe84e030 
> (org.apache.sentry.hdfs.MetastorePlugin)
> {code}
> Additionally,  a significant percentage of memory is wasted due to duplicate 
> strings:
> {code}
> 7. DUPLICATE STRINGS
> Total strings: 29,986,017  Unique strings: 9,640,413  Duplicate values: 
> 4,897,743  Overhead: 2,570,746K (9.4%)
> {code}
> Of them, more than 1/3 come from sentry:
> {code}
>   917,331K (3.3%), 10517636 dup strings (498477 unique), 10517636 dup backing 
> arrays:
>      <-- org.apache.sentry.hdfs.service.thrift.TPathEntry.pathElement <--  
> {j.u.HashMap}.values <-- org.apache.sen
> try.hdfs.service.thrift.TPathsDump.nodeMap <-- 
> org.apache.sentry.hdfs.service.thrift.TPathsUpdate.pathsDump <-- Ja
> va Local@7fea0851c360 (org.apache.sentry.hdfs.service.thrift.TPathsUpdate)
> {code}
> These can be eliminated by inserting String.intern() calls in the appropriate 
> places.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to