[ 
https://issues.apache.org/jira/browse/SENTRY-1892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16137673#comment-16137673
 ] 

Misha Dmitriev commented on SENTRY-1892:
----------------------------------------

[~akolb] Yes, Sets are needed for HMSPaths$Entry.authzObjs - when 
adding/removing elements here, the code needs the set functionality to avoid 
adding duplicate elements. But when these sets are serialized and deserialized, 
there is no reason to duplicate their contents into/from TPathEntry.authzObjs. 
That is, TPathEntry.authzObjs and TPathEntry.children are used only to store 
set contents temporarily, before serialization and after deserialization, and 
don't really need the set functionality.

Regarding your other question, i.e. Collections.singleton(Object 
singleElement). I've checked its code, and this will indeed save some memory. 
If you have an ordinary TreeSet, HashSet etc. with just one element X, it means 
that there are at least two objects in memory in addition to X: a Set itself 
and a Set$Entry. In case of HashSet, there is one more extra object (an array): 
HashSet -> Entry[] -> Entry -> X.

A call to singleton() creates an instance of internal class 
Collections$SingleSet { Object singleElement; .... }. So it means creating just 
one extra object instead of 2-3, and some memory will be saved.

However, this singleton set is immutable, so if we use it, we will still need 
to make ugly changes to the code to check whether our set is a singleton or 
not, replace it with a real TreeSet when more objects are going to be added to 
the set, etc. It will make the code almost the same as now, yet will result in 
one extra object per every single-element set, i.e. in much smaller memory 
savings.

> Reduce memory consumption of HMSPath$Entry and TPathEntry
> ---------------------------------------------------------
>
>                 Key: SENTRY-1892
>                 URL: https://issues.apache.org/jira/browse/SENTRY-1892
>             Project: Sentry
>          Issue Type: Improvement
>          Components: Hdfs Plugin
>            Reporter: Misha Dmitriev
>            Assignee: Misha Dmitriev
>         Attachments: SENTRY-1892.01.patch
>
>
> We recently analyzed with jxray (www.jxray.com) some heap dumps from NameNode 
> running in a big HDFS installation with Sentry enabled. One dump is 
> particularly interesting, because it was taken when a full Sentry update was 
> in progress. Because of it, used heap was at its maximum: there were both the 
> old HMSPath$Entry tree of objects in memory, and the data for the new one in 
> TPathEntry objects.
> The old and new Sentry-related data take a pretty large portion of the heap, 
> 7.9% and 12.9% respectively:
> {code}
>  ---- Object tree for GC root(s) Java Local@7f9c9a0b7808 
> (org.apache.sentry.hdfs.SentryAuthorizationInfo) ----
>   2,302,963K (7.9%) (1 of org.apache.sentry.hdfs.SentryAuthorizationInfo)
>      <-- Java Local@7f9c9a0b7808 
> (org.apache.sentry.hdfs.SentryAuthorizationInfo)
> ....
>  ---- Object tree for GC root(s) Java Local@7f9c2b9138c8 
> (org.apache.sentry.hdfs.service.thrift.TPathsDump) ----
>   3,760,229K (12.9%) (1 of org.apache.sentry.hdfs.service.thrift.TPathsDump)
>      <-- Java Local@7f9c2b9138c8 
> (org.apache.sentry.hdfs.service.thrift.TPathsDump)
> ...
> {code}
> This is a very considerable portion of the heap. Furthermore, the second 
> portion - the data in TPathsDump - is mostly temporary, and creates a big 
> memory spike, many extra GC pauses, and in the worst case may cause a crash 
> due to OOM. Thus it's very desirable to reduce memory used by these data 
> structures.
> It appears that some of the data structures used here are suboptimal in terms 
> of memory. Here is the list of things that can be fixed:
> 1. TPathEntry.children and TPathEntry.authzObjs are both defined as sets in 
> sentry_hdfs_service.thrift. In the Java code, they become HashSets. However, 
> no real set operations (check for element, add element...) are used on them. 
> Rather, they are used as simple lists, from which the respective data 
> structures in HMSPaths$Entry are initialized. HashSets are very ineconomical 
> in terms of memory, because they reuse HashMap code, and one HashMap$Entry 
> object, taking 32-48 bytes, is created for each hash element. From the class 
> histogram in the dump, HashSets are taking 5.8% of the heap. Thus if we 
> replace sets with lists in TPathEntry, we can reduce heap substantially.
> 2. JXRay analysis for suboptimal collections shows the following:
> {code}
> 9. BAD COLLECTIONS
> Total collections: 40,324,452  Bad collections: 26,076,002  Overhead: 
> 3,361,873K (11.6%)
> Top bad collections:
>     Ovhd           Problem           Num objs      Type
> -------------------------------------------------------
> 922,908K (3.2%)     1-elem      5133339 (54%)     j.u.HashSet
> 646,707K (2.2%)     1-elem      3941834 (98%)     j.u.TreeSet
> 459,824K (1.6%)     1-elem      1731283 (10%)     j.u.HashMap
> 339,906K (1.2%)      empty      3625374 (38%)     j.u.HashSet
> 282,265K (1.0%)      empty      3985194 (25%)     j.u.HashMap
> 276,279K (1.0%)     1-elem      3926377 (55%)     j.u.ArrayList
> 163,534K (0.6%)      small        572788 (3%)     j.u.HashMap
> 138,729K (0.5%)      small        574613 (6%)     j.u.HashSet
> 116,041K (0.4%)      small      2472638 (35%)     j.u.ArrayList
> ===================================================
> 10. REFERENCE CHAINS FOR BAD COLLECTIONS
> Expensive data fields:
>   901,846K (3.1%): j.u.HashMap: 1727607 / 27% of 1-elem 458,895K (1.6%), 
> 3984640 / 62% of empty 280,170K (1.0%), 570069 / 8% of small 162,780K (0.6%)
>      <-- org.apache.sentry.hdfs.HMSPaths$Entry.children
>   656,117K (2.3%): j.u.TreeSet: 3941248 / 98% of 1-elem 646,611K (2.2%)
>      <-- org.apache.sentry.hdfs.HMSPaths$Entry.authzObjs
> ...
> {code}
> That is, in the permanent Sentry data structures, 1-element 
> HMSPaths$Entry.children tables and 1-element HMSPaths$Entry.authzObjs sets 
> cause a noticeable overhead. We can optimize these data structures by 
> replacing them with Objects and doing a trick like:
> {code}
> // Before:
>   private List<Foo> fooList = new ArrayList<>();
>  
>   void addFoo(Foo foo) {
>     fooList.add(foo);
>   }
> // After, with an optimization for 0- and 1-size
>   private Object fooObjOrList;  // null initially
>   void addFoo(Foo foo) {
>     if (fooObjOrList == null) {
>       fooObjOrList = foo;
>     } else {
>       if (fooObjOrList instanceof Foo) {
>         List<Foo> fooList = new ArrayList<>();
>         fooList.add((Foo) fooObjOrList);
>         fooList.add(foo);
>         fooObjOrList = fooList;
>       } else {
>         ((List) fooObjOrList).add(foo);
>       }
>    }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to