[ 
https://issues.apache.org/jira/browse/SENTRY-1892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Misha Dmitriev updated SENTRY-1892:
-----------------------------------
    Status: Patch Available  (was: In Progress)

> Reduce memory consumption of HMSPath$Entry and TPathEntry
> ---------------------------------------------------------
>
>                 Key: SENTRY-1892
>                 URL: https://issues.apache.org/jira/browse/SENTRY-1892
>             Project: Sentry
>          Issue Type: Improvement
>          Components: Hdfs Plugin
>            Reporter: Misha Dmitriev
>            Assignee: Misha Dmitriev
>         Attachments: SENTRY-1892.01.patch, SENTRY-1892.02.patch
>
>
> We recently analyzed with jxray (www.jxray.com) some heap dumps from NameNode 
> running in a big HDFS installation with Sentry enabled. One dump is 
> particularly interesting, because it was taken when a full Sentry update was 
> in progress. Because of it, used heap was at its maximum: there were both the 
> old HMSPath$Entry tree of objects in memory, and the data for the new one in 
> TPathEntry objects.
> The old and new Sentry-related data take a pretty large portion of the heap, 
> 7.9% and 12.9% respectively:
> {code}
>  ---- Object tree for GC root(s) Java Local@7f9c9a0b7808 
> (org.apache.sentry.hdfs.SentryAuthorizationInfo) ----
>   2,302,963K (7.9%) (1 of org.apache.sentry.hdfs.SentryAuthorizationInfo)
>      <-- Java Local@7f9c9a0b7808 
> (org.apache.sentry.hdfs.SentryAuthorizationInfo)
> ....
>  ---- Object tree for GC root(s) Java Local@7f9c2b9138c8 
> (org.apache.sentry.hdfs.service.thrift.TPathsDump) ----
>   3,760,229K (12.9%) (1 of org.apache.sentry.hdfs.service.thrift.TPathsDump)
>      <-- Java Local@7f9c2b9138c8 
> (org.apache.sentry.hdfs.service.thrift.TPathsDump)
> ...
> {code}
> This is a very considerable portion of the heap. Furthermore, the second 
> portion - the data in TPathsDump - is mostly temporary, and creates a big 
> memory spike, many extra GC pauses, and in the worst case may cause a crash 
> due to OOM. Thus it's very desirable to reduce memory used by these data 
> structures.
> It appears that some of the data structures used here are suboptimal in terms 
> of memory. Here is the list of things that can be fixed:
> 1. TPathEntry.children and TPathEntry.authzObjs are both defined as sets in 
> sentry_hdfs_service.thrift. In the Java code, they become HashSets. However, 
> no real set operations (check for element, add element...) are used on them. 
> Rather, they are used as simple lists, from which the respective data 
> structures in HMSPaths$Entry are initialized. HashSets are very ineconomical 
> in terms of memory, because they reuse HashMap code, and one HashMap$Entry 
> object, taking 32-48 bytes, is created for each hash element. From the class 
> histogram in the dump, HashSets are taking 5.8% of the heap. Thus if we 
> replace sets with lists in TPathEntry, we can reduce heap substantially.
> 2. JXRay analysis for suboptimal collections shows the following:
> {code}
> 9. BAD COLLECTIONS
> Total collections: 40,324,452  Bad collections: 26,076,002  Overhead: 
> 3,361,873K (11.6%)
> Top bad collections:
>     Ovhd           Problem           Num objs      Type
> -------------------------------------------------------
> 922,908K (3.2%)     1-elem      5133339 (54%)     j.u.HashSet
> 646,707K (2.2%)     1-elem      3941834 (98%)     j.u.TreeSet
> 459,824K (1.6%)     1-elem      1731283 (10%)     j.u.HashMap
> 339,906K (1.2%)      empty      3625374 (38%)     j.u.HashSet
> 282,265K (1.0%)      empty      3985194 (25%)     j.u.HashMap
> 276,279K (1.0%)     1-elem      3926377 (55%)     j.u.ArrayList
> 163,534K (0.6%)      small        572788 (3%)     j.u.HashMap
> 138,729K (0.5%)      small        574613 (6%)     j.u.HashSet
> 116,041K (0.4%)      small      2472638 (35%)     j.u.ArrayList
> ===================================================
> 10. REFERENCE CHAINS FOR BAD COLLECTIONS
> Expensive data fields:
>   901,846K (3.1%): j.u.HashMap: 1727607 / 27% of 1-elem 458,895K (1.6%), 
> 3984640 / 62% of empty 280,170K (1.0%), 570069 / 8% of small 162,780K (0.6%)
>      <-- org.apache.sentry.hdfs.HMSPaths$Entry.children
>   656,117K (2.3%): j.u.TreeSet: 3941248 / 98% of 1-elem 646,611K (2.2%)
>      <-- org.apache.sentry.hdfs.HMSPaths$Entry.authzObjs
> ...
> {code}
> That is, in the permanent Sentry data structures, 1-element 
> HMSPaths$Entry.children tables and 1-element HMSPaths$Entry.authzObjs sets 
> cause a noticeable overhead. We can optimize these data structures by 
> replacing them with Objects and doing a trick like:
> {code}
> // Before:
>   private List<Foo> fooList = new ArrayList<>();
>  
>   void addFoo(Foo foo) {
>     fooList.add(foo);
>   }
> // After, with an optimization for 0- and 1-size
>   private Object fooObjOrList;  // null initially
>   void addFoo(Foo foo) {
>     if (fooObjOrList == null) {
>       fooObjOrList = foo;
>     } else {
>       if (fooObjOrList instanceof Foo) {
>         List<Foo> fooList = new ArrayList<>();
>         fooList.add((Foo) fooObjOrList);
>         fooList.add(foo);
>         fooObjOrList = fooList;
>       } else {
>         ((List) fooObjOrList).add(foo);
>       }
>    }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to