clintropolis opened a new pull request #10812:
URL: https://github.com/apache/druid/pull/10812


   ### Description
   
   This PR primarily does two things:
   * refactors the (afaict) currently unused Druid view system in the hopes 
that someday it can be built into something useful, by splitting into a 
separate calcite schema and introducing a new `ResourceType.VIEW` authorization 
construct to allow defining access to views separately from datasources.
   * refactors `SqlLifecycle` to authorize datasources (and now views) up 
front, before preparing or planning a query, by analyzing the SQL expression 
directly rather than waiting until after the transformation from SQL to native 
Druid query is done.
   
   _Note that this PR does not introduce or propose a view management system at 
this time, that exercise is left for future work, what is going on here is just 
refactoring some internal stuff that might someday be used for something cool._
   
   ### Views
   Instead of living in `DruidSchema`, a new `ViewSchema` has been introduced 
to hold all Druid views. This is technically an incompatible change, since 
`druid.some_view` is now `view.some_view`, but since this isn't currently a 
real feature I think this should not be problematic.
   
   Prior to this PR, authorization was performed against the final set of 
datasources which would be touched by the native Druid query, and done after 
planning has been completed (and done twice, once in `SqlLifecycle` and also 
again in `QueryLifecycle`). However, this is not appropriate for views, as it 
has an overly strict requirement that to query a view, you must also be 
authorized to read from all underlying datasources, which of course precludes 
scenarios like   providing a restricted view onto a larger underlying table as 
a means to control access.
   
   To remedy this, a new `ResourceType.VIEW` construct has been introduced, and 
SQL query authorization is now done against the set of `Resource` of 
`ResourceType.DATASOURCE` and `ResourceType.VIEW` which are utilized in the 
query. In this model views are more or less authorized the same as tables, just 
separately. 
   
   Splitting the 'view' and 'druid' schema also makes any theoretical view 
manager implementation easier to implement because it would not have to worry 
about name collisions with Druid datasources, just other views, and requires no 
other changes to Druid like a real solution cohabitating the same schema would 
require.
   
   ### SqlLifecycle and DruidPlanner
   Since authorization required a refactor to support independent view 
authorization anyway, I took the opportunity to rework a bit how `SqlLifecycle` 
functions, changing the order of the state transitions to move authorization 
earlier in the process, which is now modeled by this new flow:
   ```
   new -> initialized -> authorizing -> authorized -> (unauthorized | planned 
-> executing -> done)
   ```
   To collect the set of `Resource` to authenticate, the `DruidPlanner` which 
`SqlLifecycle` wraps, has a new method `validateAndCollectResources` which 
returns a `ResourceResult` containing the set of all views and datasources that 
were identified in the query. Mechanically this list is constructed using a 
`SqlShuttle` which walks the SQL expression tree to examine `SqlIdentifier` 
which the validator associates with a table identifier to lookup whether its a 
'druid' or a 'view', and construct the correct `Resource` accordingly.
   
   The added `DruidPlannerResourceAnalyzeTest` has some tests which are trying 
to trip up the `SqlResourceCollectorShuttle`, it looks good so far in my 
testing but I'm sure there are scenarios I'm not thinking of though. If anyone 
is worried about this new approach to resolving the resources to authorize, I 
can introduce a config option to retain the previous check against the old set 
of datasource names after planning (though I would rather not do this if 
possible).
   
   Authorizing before planning should also help reduce some potential waste 
spent planning queries, which can be non-trivial depending on complexity, and 
then would later not be able to execute due to authorization failure.
   
   I think there are some further improvements that can be made with 
`SqlLifecycle` and `DruidPlanner`, but have not done these changes in this PR 
to keep it from growing too large. For example, there is a lot of repeated 
parsing/validation work done between the phases of the lifecycle, which could 
probably be re-used.
   
   Finally, I tried to fill out some of the javadoc in this area, since it was 
kind of lacking, hopefully it's better.
   
   <hr>
   
   This PR has:
   - [x] been self-reviewed.
      - [x] using the [concurrency 
checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md)
 (Remove this item if the PR doesn't have any relation to concurrency.)
   - [ ] added documentation for new or modified features or behaviors.
   - [x] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in 
[licenses.yaml](https://github.com/apache/druid/blob/master/licenses.yaml)
   - [ ] added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   - [ ] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.
   - [ ] added integration tests.
   - [ ] been tested in a test Druid cluster.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to