Andrew Wong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13069 )

Change subject: [authz] new SentryAuthzProvider's caching strategy
......................................................................


Patch Set 4:

(6 comments)

Another partial review, will do a fuller pass, but posting in case my comment 
on the commit message is helpful to other reviewers.

http://gerrit.cloudera.org:8080/#/c/13069/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/13069/4//COMMIT_MSG@8
PS4, Line 8:
           : This patch updates the way how the privilege cache in
           : SentryAuthzProvider is populated.  Prior to this patch, only one 
entry
           : per sanitized Sentry's response was created.  With this patch,
           : a response may be split into two entries: one contains SERVER- and
           : DATABASE-scope privileges, and another contains TABLE- and 
COLUMN-scope
           : privileges.  Of course, it also changes the lookup process: now 
it's
           : necessary to search for two entries in the cache if looking up for
           : an entry with privileges for an authorizable of the TABLE scope.
           :
           : The new caching strategy leverages the fact that Sentry includes
           : information on privileges granted on authorizables of higher scopes
           : in the hierarchy, if any.  The new strategy is beneficial in cases
           : when a user has privileges granted on DATABASE.  In that case, once
           : there was a request to authorize an action on a table or a column
           : of that table, next request to authorize an action on the database
           : itself will hit the cache, avoiding an extra RPC sent to Sentry.
           : Another example that benefits from the new caching scheme are
           : scenarios like AuthorizeDropTable(tableA) followed by
           : AuthorizeCreateTable(tableA).
There are a bunch of details in here, but I think it'd be easier to follow, and 
in turn, review the rest of this patch, with roughly the following structure:
1. some context explaining what was wrong with the existing caching strategy
2. a concrete example depicting what was bad
3. what is the high-level approach that fixes this problem?
4. how the problem in #2 will work now.
5. other implications/examples, if any seem appropriate.

Here's my attempt based on my understanding of this patch (feel free to correct 
anything that's off):

The API that Kudu uses to fetch privileges from Sentry will return privileges 
corresponding to all ancestors and descendants of the requested authorizable in 
Sentry's authorizable scope hierarchy tree. For instance, if requesting 
database-level privileges on "s.d", the API will return privileges on "s", 
"s.d", privileges on all tables within "s.d", and privileges on all columns 
therein. This is unsettling because a single database may have many tables and 
columns, and we would currently end up caching all of them, even those that may 
not be tables and columns stored in Kudu.

Additionally, the existing cache-lookup policy is very simple and doesn't allow 
the usage of usable privileges that exist in the cache where it could. For 
instance, the existing policy might store the following in the cache:

 { /s/d/t => ["metadata" on /s, "create" on /s/d, "insert" on /s/d/t, "select" 
on /s/d/t/c] }

A cache lookup for privileges on /s/d would yield an expensive database-scoped 
request to Sentry, even though, if we were smarter about how we structured our 
privileges in cache, we could return the cached database-scoped privilege.

This patch addresses these problems by splitting the cached privileges into two 
entries: those for database-scope and above (keyed by database), and those for 
table-scope and below (keyed by table). For cache lookups for a given 
authorizable, this allows us to use existing privileges for higher scopes from 
other authorizables where possible, avoiding expensive calls to Sentry for 
server- or database-scoped privileges. Now, for the same set of privileges, the 
cache will store:

 { /s/d => ["metadata" on /s, "create" on /s/d],
   /s/d/t => ["insert" on /s/d/t, "select" on /s/d/t/c] }

With the above in cache, a lookup on /s/d would thus be able to leverage the 
existing entry. The trade-off here is that lookups on /s/d/t will now require 
two lookups, one on /s/d and one on /s/d/t, to return the complete lists of 
privileges in /s/d/t's hierarchy tree.

Why database and above, and table and below? Why not separate them further? 
Based on our current privilege model, we only expect to request privileges at 
the database scope (e.g. for CreateTable) or table scope (e.g. OpenTable), so 
this separation seems natural while limiting the number of cache lookups as 
much as possible.

As a further optimization on cache size, because there aren't constraints that 
the privileges in Sentry when requesting privileges at the database scope 
correspond to things stored in Kudu, we will now ignore privileges below the 
database scope when making such requests when caching.
(maybe this belongs in a separate patch)


http://gerrit.cloudera.org:8080/#/c/13069/2/src/kudu/master/sentry_privileges_fetcher.cc
File src/kudu/master/sentry_privileges_fetcher.cc:

http://gerrit.cloudera.org:8080/#/c/13069/2/src/kudu/master/sentry_privileges_fetcher.cc@541
PS2, Line 541: HECK(kScopeDabase.Implies(scope));
             :   // Not expecting requests for authorizables of scope narrower 
than TABLE.
             :   DCHECK(scope.Implies(kScopeTable));
             :   // Do not query Sentry for authz scopes narrower than 'TABLE'.
             :   const auto& key = aggregate_key.GetFlattened
> No, it does not -- those are queries to the cache.
Oof, yeah pretty obvious in retrospect. Rushed code review is bad code review.


http://gerrit.cloudera.org:8080/#/c/13069/2/src/kudu/master/sentry_privileges_fetcher.cc@610
PS2, Line 610: omes for an authorizable of the TABLE scope.
> OK, so far I used Implies(), but I'm thinking of adding comparison operatio
Yeah, I feel you. Maybe it's worth just being explicit and doing 
requested_scope == TABLE || requested_scope == COLUMN


http://gerrit.cloudera.org:8080/#/c/13069/4/src/kudu/master/sentry_privileges_fetcher.cc
File src/kudu/master/sentry_privileges_fetcher.cc:

http://gerrit.cloudera.org:8080/#/c/13069/4/src/kudu/master/sentry_privileges_fetcher.cc@534
PS4, Line 534: scope(requested_scope);
nit: would you mind reversing the names so that we use requested_scope 
throughout? That way it's explicitly obvious when reading below what scope 
we're referring to.


http://gerrit.cloudera.org:8080/#/c/13069/4/src/kudu/master/sentry_privileges_fetcher.cc@539
PS4, Line 539: // TODO(aserbin): uncomment when it's guaranteed we are not 
getting
             :   //                requests for authorizables of scope wider 
than DATABASE.
I think this is already true; we never request privileges at the server scope.


http://gerrit.cloudera.org:8080/#/c/13069/4/src/kudu/master/sentry_privileges_fetcher.cc@545
PS4, Line 545: kScopeTable.scope())
nit: I like these in that it means we don't have to do annoying things like 
write out SentryAuthorizableScope(SentryAuthorizableScope::TABLE) all the time, 
and I know it should be obvious, but when pulling out the enum like this, IMO 
this takes extra cycles to read. Could we keep SentryAuthorizableScope::TABLE 
here and L617 and L637?



--
To view, visit http://gerrit.cloudera.org:8080/13069
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id96181345e357a104e28314d8d8d88633dcf9608
Gerrit-Change-Number: 13069
Gerrit-PatchSet: 4
Gerrit-Owner: Alexey Serbin <aser...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <aser...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Hao Hao <hao....@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Sat, 20 Apr 2019 07:30:54 +0000
Gerrit-HasComments: Yes

Reply via email to