Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/13069 )
Change subject: [authz] new SentryAuthzProvider's caching strategy ...................................................................... Patch Set 4: (6 comments) Another partial review, will do a fuller pass, but posting in case my comment on the commit message is helpful to other reviewers. http://gerrit.cloudera.org:8080/#/c/13069/4//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/13069/4//COMMIT_MSG@8 PS4, Line 8: : This patch updates the way how the privilege cache in : SentryAuthzProvider is populated. Prior to this patch, only one entry : per sanitized Sentry's response was created. With this patch, : a response may be split into two entries: one contains SERVER- and : DATABASE-scope privileges, and another contains TABLE- and COLUMN-scope : privileges. Of course, it also changes the lookup process: now it's : necessary to search for two entries in the cache if looking up for : an entry with privileges for an authorizable of the TABLE scope. : : The new caching strategy leverages the fact that Sentry includes : information on privileges granted on authorizables of higher scopes : in the hierarchy, if any. The new strategy is beneficial in cases : when a user has privileges granted on DATABASE. In that case, once : there was a request to authorize an action on a table or a column : of that table, next request to authorize an action on the database : itself will hit the cache, avoiding an extra RPC sent to Sentry. : Another example that benefits from the new caching scheme are : scenarios like AuthorizeDropTable(tableA) followed by : AuthorizeCreateTable(tableA). There are a bunch of details in here, but I think it'd be easier to follow, and in turn, review the rest of this patch, with roughly the following structure: 1. some context explaining what was wrong with the existing caching strategy 2. a concrete example depicting what was bad 3. what is the high-level approach that fixes this problem? 4. how the problem in #2 will work now. 5. other implications/examples, if any seem appropriate. Here's my attempt based on my understanding of this patch (feel free to correct anything that's off): The API that Kudu uses to fetch privileges from Sentry will return privileges corresponding to all ancestors and descendants of the requested authorizable in Sentry's authorizable scope hierarchy tree. For instance, if requesting database-level privileges on "s.d", the API will return privileges on "s", "s.d", privileges on all tables within "s.d", and privileges on all columns therein. This is unsettling because a single database may have many tables and columns, and we would currently end up caching all of them, even those that may not be tables and columns stored in Kudu. Additionally, the existing cache-lookup policy is very simple and doesn't allow the usage of usable privileges that exist in the cache where it could. For instance, the existing policy might store the following in the cache: { /s/d/t => ["metadata" on /s, "create" on /s/d, "insert" on /s/d/t, "select" on /s/d/t/c] } A cache lookup for privileges on /s/d would yield an expensive database-scoped request to Sentry, even though, if we were smarter about how we structured our privileges in cache, we could return the cached database-scoped privilege. This patch addresses these problems by splitting the cached privileges into two entries: those for database-scope and above (keyed by database), and those for table-scope and below (keyed by table). For cache lookups for a given authorizable, this allows us to use existing privileges for higher scopes from other authorizables where possible, avoiding expensive calls to Sentry for server- or database-scoped privileges. Now, for the same set of privileges, the cache will store: { /s/d => ["metadata" on /s, "create" on /s/d], /s/d/t => ["insert" on /s/d/t, "select" on /s/d/t/c] } With the above in cache, a lookup on /s/d would thus be able to leverage the existing entry. The trade-off here is that lookups on /s/d/t will now require two lookups, one on /s/d and one on /s/d/t, to return the complete lists of privileges in /s/d/t's hierarchy tree. Why database and above, and table and below? Why not separate them further? Based on our current privilege model, we only expect to request privileges at the database scope (e.g. for CreateTable) or table scope (e.g. OpenTable), so this separation seems natural while limiting the number of cache lookups as much as possible. As a further optimization on cache size, because there aren't constraints that the privileges in Sentry when requesting privileges at the database scope correspond to things stored in Kudu, we will now ignore privileges below the database scope when making such requests when caching. (maybe this belongs in a separate patch) http://gerrit.cloudera.org:8080/#/c/13069/2/src/kudu/master/sentry_privileges_fetcher.cc File src/kudu/master/sentry_privileges_fetcher.cc: http://gerrit.cloudera.org:8080/#/c/13069/2/src/kudu/master/sentry_privileges_fetcher.cc@541 PS2, Line 541: HECK(kScopeDabase.Implies(scope)); : // Not expecting requests for authorizables of scope narrower than TABLE. : DCHECK(scope.Implies(kScopeTable)); : // Do not query Sentry for authz scopes narrower than 'TABLE'. : const auto& key = aggregate_key.GetFlattened > No, it does not -- those are queries to the cache. Oof, yeah pretty obvious in retrospect. Rushed code review is bad code review. http://gerrit.cloudera.org:8080/#/c/13069/2/src/kudu/master/sentry_privileges_fetcher.cc@610 PS2, Line 610: omes for an authorizable of the TABLE scope. > OK, so far I used Implies(), but I'm thinking of adding comparison operatio Yeah, I feel you. Maybe it's worth just being explicit and doing requested_scope == TABLE || requested_scope == COLUMN http://gerrit.cloudera.org:8080/#/c/13069/4/src/kudu/master/sentry_privileges_fetcher.cc File src/kudu/master/sentry_privileges_fetcher.cc: http://gerrit.cloudera.org:8080/#/c/13069/4/src/kudu/master/sentry_privileges_fetcher.cc@534 PS4, Line 534: scope(requested_scope); nit: would you mind reversing the names so that we use requested_scope throughout? That way it's explicitly obvious when reading below what scope we're referring to. http://gerrit.cloudera.org:8080/#/c/13069/4/src/kudu/master/sentry_privileges_fetcher.cc@539 PS4, Line 539: // TODO(aserbin): uncomment when it's guaranteed we are not getting : // requests for authorizables of scope wider than DATABASE. I think this is already true; we never request privileges at the server scope. http://gerrit.cloudera.org:8080/#/c/13069/4/src/kudu/master/sentry_privileges_fetcher.cc@545 PS4, Line 545: kScopeTable.scope()) nit: I like these in that it means we don't have to do annoying things like write out SentryAuthorizableScope(SentryAuthorizableScope::TABLE) all the time, and I know it should be obvious, but when pulling out the enum like this, IMO this takes extra cycles to read. Could we keep SentryAuthorizableScope::TABLE here and L617 and L637? -- To view, visit http://gerrit.cloudera.org:8080/13069 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Id96181345e357a104e28314d8d8d88633dcf9608 Gerrit-Change-Number: 13069 Gerrit-PatchSet: 4 Gerrit-Owner: Alexey Serbin <aser...@cloudera.com> Gerrit-Reviewer: Alexey Serbin <aser...@cloudera.com> Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com> Gerrit-Reviewer: Hao Hao <hao....@cloudera.com> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Comment-Date: Sat, 20 Apr 2019 07:30:54 +0000 Gerrit-HasComments: Yes