clintropolis opened a new pull request #10812:
URL: https://github.com/apache/druid/pull/10812
### Description
This PR primarily does two things:
* refactors the (afaict) currently unused Druid view system in the hopes
that someday it can be built into something useful, by splitting into a
separate calcite schema and introducing a new `ResourceType.VIEW` authorization
construct to allow defining access to views separately from datasources.
* refactors `SqlLifecycle` to authorize datasources (and now views) up
front, before preparing or planning a query, by analyzing the SQL expression
directly rather than waiting until after the transformation from SQL to native
Druid query is done.
_Note that this PR does not introduce or propose a view management system at
this time, that exercise is left for future work, what is going on here is just
refactoring some internal stuff that might someday be used for something cool._
### Views
Instead of living in `DruidSchema`, a new `ViewSchema` has been introduced
to hold all Druid views. This is technically an incompatible change, since
`druid.some_view` is now `view.some_view`, but since this isn't currently a
real feature I think this should not be problematic.
Prior to this PR, authorization was performed against the final set of
datasources which would be touched by the native Druid query, and done after
planning has been completed (and done twice, once in `SqlLifecycle` and also
again in `QueryLifecycle`). However, this is not appropriate for views, as it
has an overly strict requirement that to query a view, you must also be
authorized to read from all underlying datasources, which of course precludes
scenarios like providing a restricted view onto a larger underlying table as
a means to control access.
To remedy this, a new `ResourceType.VIEW` construct has been introduced, and
SQL query authorization is now done against the set of `Resource` of
`ResourceType.DATASOURCE` and `ResourceType.VIEW` which are utilized in the
query. In this model views are more or less authorized the same as tables, just
separately.
Splitting the 'view' and 'druid' schema also makes any theoretical view
manager implementation easier to implement because it would not have to worry
about name collisions with Druid datasources, just other views, and requires no
other changes to Druid like a real solution cohabitating the same schema would
require.
### SqlLifecycle and DruidPlanner
Since authorization required a refactor to support independent view
authorization anyway, I took the opportunity to rework a bit how `SqlLifecycle`
functions, changing the order of the state transitions to move authorization
earlier in the process, which is now modeled by this new flow:
```
new -> initialized -> authorizing -> authorized -> (unauthorized | planned
-> executing -> done)
```
To collect the set of `Resource` to authenticate, the `DruidPlanner` which
`SqlLifecycle` wraps, has a new method `validateAndCollectResources` which
returns a `ResourceResult` containing the set of all views and datasources that
were identified in the query. Mechanically this list is constructed using a
`SqlShuttle` which walks the SQL expression tree to examine `SqlIdentifier`
which the validator associates with a table identifier to lookup whether its a
'druid' or a 'view', and construct the correct `Resource` accordingly.
The added `DruidPlannerResourceAnalyzeTest` has some tests which are trying
to trip up the `SqlResourceCollectorShuttle`, it looks good so far in my
testing but I'm sure there are scenarios I'm not thinking of though. If anyone
is worried about this new approach to resolving the resources to authorize, I
can introduce a config option to retain the previous check against the old set
of datasource names after planning (though I would rather not do this if
possible).
Authorizing before planning should also help reduce some potential waste
spent planning queries, which can be non-trivial depending on complexity, and
then would later not be able to execute due to authorization failure.
I think there are some further improvements that can be made with
`SqlLifecycle` and `DruidPlanner`, but have not done these changes in this PR
to keep it from growing too large. For example, there is a lot of repeated
parsing/validation work done between the phases of the lifecycle, which could
probably be re-used.
Finally, I tried to fill out some of the javadoc in this area, since it was
kind of lacking, hopefully it's better.
<hr>
This PR has:
- [x] been self-reviewed.
- [x] using the [concurrency
checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md)
(Remove this item if the PR doesn't have any relation to concurrency.)
- [ ] added documentation for new or modified features or behaviors.
- [x] added Javadocs for most classes and all non-trivial methods. Linked
related entities via Javadoc links.
- [ ] added or updated version, license, or notice information in
[licenses.yaml](https://github.com/apache/druid/blob/master/licenses.yaml)
- [ ] added comments explaining the "why" and the intent of the code
wherever would not be obvious for an unfamiliar reader.
- [ ] added unit tests or modified existing tests to cover new code paths,
ensuring the threshold for [code
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
is met.
- [ ] added integration tests.
- [ ] been tested in a test Druid cluster.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]