Misha Dmitriev created SENTRY-1892:
--------------------------------------
Summary: Reduce memory consumption of HMSPath$Entry and TPathEntry
Key: SENTRY-1892
URL: https://issues.apache.org/jira/browse/SENTRY-1892
Project: Sentry
Issue Type: Improvement
Components: Hdfs Plugin
Reporter: Misha Dmitriev
Assignee: Misha Dmitriev
We recently analyzed with jxray (www.jxray.com) some heap dumps from NameNode
running in a big HDFS installation with Sentry enabled. One dump is
particularly interesting, because it was taken when a full Sentry update was in
progress. Because of it, used heap was at its maximum: there were both the old
HMSPath$Entry tree of objects in memory, and the data for the new one in
TPathEntry objects.
The old and new Sentry-related data take a pretty large portion of the heap,
7.9% and 12.9% respectively:
{code}
---- Object tree for GC root(s) Java Local@7f9c9a0b7808
(org.apache.sentry.hdfs.SentryAuthorizationInfo) ----
2,302,963K (7.9%) (1 of org.apache.sentry.hdfs.SentryAuthorizationInfo)
<-- Java Local@7f9c9a0b7808
(org.apache.sentry.hdfs.SentryAuthorizationInfo)
....
---- Object tree for GC root(s) Java Local@7f9c2b9138c8
(org.apache.sentry.hdfs.service.thrift.TPathsDump) ----
3,760,229K (12.9%) (1 of org.apache.sentry.hdfs.service.thrift.TPathsDump)
<-- Java Local@7f9c2b9138c8
(org.apache.sentry.hdfs.service.thrift.TPathsDump)
...
{code}
This is a very considerable portion of the heap. Furthermore, the second
portion - the data in TPathsDump - is mostly temporary, and creates a big
memory spike, many extra GC pauses, and in the worst case may cause a crash due
to OOM. Thus it's very desirable to reduce memory used by these data structures.
It appears that some of the data structures used here are suboptimal in terms
of memory. Here is the list of things that can be fixed:
1. TPathEntry.children and TPathEntry.authzObjs are both defined as sets in
sentry_hdfs_service.thrift. In the Java code, they become HashSets. However, no
real set operations (check for element, add element...) are used on them.
Rather, they are used as simple lists, from which the respective data
structures in HMSPaths$Entry are initialized. HashSets are very ineconomical in
terms of memory, because they reuse HashMap code, and one HashMap$Entry object,
taking 32-48 bytes, is created for each hash element. From the class histogram
in the dump, HashSets are taking 5.8% of the heap. Thus if we replace sets with
lists in TPathEntry, we can reduce heap substantially.
2. JXRay analysis for suboptimal collections shows the following:
{code}
9. BAD COLLECTIONS
Total collections: 40,324,452 Bad collections: 26,076,002 Overhead:
3,361,873K (11.6%)
Top bad collections:
Ovhd Problem Num objs Type
-------------------------------------------------------
922,908K (3.2%) 1-elem 5133339 (54%) j.u.HashSet
646,707K (2.2%) 1-elem 3941834 (98%) j.u.TreeSet
459,824K (1.6%) 1-elem 1731283 (10%) j.u.HashMap
339,906K (1.2%) empty 3625374 (38%) j.u.HashSet
282,265K (1.0%) empty 3985194 (25%) j.u.HashMap
276,279K (1.0%) 1-elem 3926377 (55%) j.u.ArrayList
163,534K (0.6%) small 572788 (3%) j.u.HashMap
138,729K (0.5%) small 574613 (6%) j.u.HashSet
116,041K (0.4%) small 2472638 (35%) j.u.ArrayList
===================================================
10. REFERENCE CHAINS FOR BAD COLLECTIONS
Expensive data fields:
901,846K (3.1%): j.u.HashMap: 1727607 / 27% of 1-elem 458,895K (1.6%),
3984640 / 62% of empty 280,170K (1.0%), 570069 / 8% of small 162,780K (0.6%)
<-- org.apache.sentry.hdfs.HMSPaths$Entry.children
656,117K (2.3%): j.u.TreeSet: 3941248 / 98% of 1-elem 646,611K (2.2%)
<-- org.apache.sentry.hdfs.HMSPaths$Entry.authzObjs
...
{code}
That is, in the permanent Sentry data structures, 1-element
HMSPaths$Entry.children tables and 1-element HMSPaths$Entry.authzObjs sets
cause a noticeable overhead. We can optimize these data structures by replacing
them with Objects and doing a trick like:
{code}
// Before:
private List<Foo> fooList = new ArrayList<>();
void addFoo(Foo foo) {
fooList.add(foo);
}
// After, with an optimization for 0- and 1-size
private Object fooObjOrList; // null initially
void addFoo(Foo foo) {
if (fooObjOrList == null) {
fooObjOrList = foo;
} else {
if (fooObjOrList instanceof Foo) {
List<Foo> fooList = new ArrayList<>();
fooList.add((Foo) fooObjOrList);
fooList.add(foo);
fooObjOrList = fooList;
} else {
((List) fooObjOrList).add(foo);
}
}
}
{code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)