[
https://issues.apache.org/jira/browse/SENTRY-1892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alexander Kolbasov updated SENTRY-1892:
---------------------------------------
Resolution: Fixed
Status: Resolved (was: Patch Available)
Transfering +2 from reviewboard.
[[email protected]] Thank you for your contribution!
> Reduce memory consumption of HMSPath$Entry and TPathEntry
> ---------------------------------------------------------
>
> Key: SENTRY-1892
> URL: https://issues.apache.org/jira/browse/SENTRY-1892
> Project: Sentry
> Issue Type: Improvement
> Components: Hdfs Plugin
> Reporter: Misha Dmitriev
> Assignee: Misha Dmitriev
> Attachments: SENTRY-1892.01.patch, SENTRY-1892.02.patch
>
>
> We recently analyzed with jxray (www.jxray.com) some heap dumps from NameNode
> running in a big HDFS installation with Sentry enabled. One dump is
> particularly interesting, because it was taken when a full Sentry update was
> in progress. Because of it, used heap was at its maximum: there were both the
> old HMSPath$Entry tree of objects in memory, and the data for the new one in
> TPathEntry objects.
> The old and new Sentry-related data take a pretty large portion of the heap,
> 7.9% and 12.9% respectively:
> {code}
> ---- Object tree for GC root(s) Java Local@7f9c9a0b7808
> (org.apache.sentry.hdfs.SentryAuthorizationInfo) ----
> 2,302,963K (7.9%) (1 of org.apache.sentry.hdfs.SentryAuthorizationInfo)
> <-- Java Local@7f9c9a0b7808
> (org.apache.sentry.hdfs.SentryAuthorizationInfo)
> ....
> ---- Object tree for GC root(s) Java Local@7f9c2b9138c8
> (org.apache.sentry.hdfs.service.thrift.TPathsDump) ----
> 3,760,229K (12.9%) (1 of org.apache.sentry.hdfs.service.thrift.TPathsDump)
> <-- Java Local@7f9c2b9138c8
> (org.apache.sentry.hdfs.service.thrift.TPathsDump)
> ...
> {code}
> This is a very considerable portion of the heap. Furthermore, the second
> portion - the data in TPathsDump - is mostly temporary, and creates a big
> memory spike, many extra GC pauses, and in the worst case may cause a crash
> due to OOM. Thus it's very desirable to reduce memory used by these data
> structures.
> It appears that some of the data structures used here are suboptimal in terms
> of memory. Here is the list of things that can be fixed:
> 1. TPathEntry.children and TPathEntry.authzObjs are both defined as sets in
> sentry_hdfs_service.thrift. In the Java code, they become HashSets. However,
> no real set operations (check for element, add element...) are used on them.
> Rather, they are used as simple lists, from which the respective data
> structures in HMSPaths$Entry are initialized. HashSets are very ineconomical
> in terms of memory, because they reuse HashMap code, and one HashMap$Entry
> object, taking 32-48 bytes, is created for each hash element. From the class
> histogram in the dump, HashSets are taking 5.8% of the heap. Thus if we
> replace sets with lists in TPathEntry, we can reduce heap substantially.
> 2. JXRay analysis for suboptimal collections shows the following:
> {code}
> 9. BAD COLLECTIONS
> Total collections: 40,324,452 Bad collections: 26,076,002 Overhead:
> 3,361,873K (11.6%)
> Top bad collections:
> Ovhd Problem Num objs Type
> -------------------------------------------------------
> 922,908K (3.2%) 1-elem 5133339 (54%) j.u.HashSet
> 646,707K (2.2%) 1-elem 3941834 (98%) j.u.TreeSet
> 459,824K (1.6%) 1-elem 1731283 (10%) j.u.HashMap
> 339,906K (1.2%) empty 3625374 (38%) j.u.HashSet
> 282,265K (1.0%) empty 3985194 (25%) j.u.HashMap
> 276,279K (1.0%) 1-elem 3926377 (55%) j.u.ArrayList
> 163,534K (0.6%) small 572788 (3%) j.u.HashMap
> 138,729K (0.5%) small 574613 (6%) j.u.HashSet
> 116,041K (0.4%) small 2472638 (35%) j.u.ArrayList
> ===================================================
> 10. REFERENCE CHAINS FOR BAD COLLECTIONS
> Expensive data fields:
> 901,846K (3.1%): j.u.HashMap: 1727607 / 27% of 1-elem 458,895K (1.6%),
> 3984640 / 62% of empty 280,170K (1.0%), 570069 / 8% of small 162,780K (0.6%)
> <-- org.apache.sentry.hdfs.HMSPaths$Entry.children
> 656,117K (2.3%): j.u.TreeSet: 3941248 / 98% of 1-elem 646,611K (2.2%)
> <-- org.apache.sentry.hdfs.HMSPaths$Entry.authzObjs
> ...
> {code}
> That is, in the permanent Sentry data structures, 1-element
> HMSPaths$Entry.children tables and 1-element HMSPaths$Entry.authzObjs sets
> cause a noticeable overhead. We can optimize these data structures by
> replacing them with Objects and doing a trick like:
> {code}
> // Before:
> private List<Foo> fooList = new ArrayList<>();
>
> void addFoo(Foo foo) {
> fooList.add(foo);
> }
> // After, with an optimization for 0- and 1-size
> private Object fooObjOrList; // null initially
> void addFoo(Foo foo) {
> if (fooObjOrList == null) {
> fooObjOrList = foo;
> } else {
> if (fooObjOrList instanceof Foo) {
> List<Foo> fooList = new ArrayList<>();
> fooList.add((Foo) fooObjOrList);
> fooList.add(foo);
> fooObjOrList = fooList;
> } else {
> ((List) fooObjOrList).add(foo);
> }
> }
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)