paul-rogers opened a new issue, #13837:
URL: https://github.com/apache/druid/issues/13837

   MSQ provides a powerful way to ingest data into Druid using SQL, placing 
ingestion in the hands of more users. Prior to MSQ, only those users with 
sufficient knowledge to use a batch ingestion spec could ingest data. With the 
wider audience comes the need to add protections to prevent users from 
accessing data that doesn't make sense in a particular Druid deployment. This 
proposal outlines a proposed extension to the Druid security model to address 
this issue. As it turns out the Druid catalog feature has similar needs, which 
we also address.
   
   ## Input Source Security in Druid Today
   
   Let us start by reviewing the current security model. Druid's model has two 
parts:
   
   * A "resource action" (class `ResourceAction`) which is essentially a triple 
of (category, name, read/write).
   * An "authenticator mapper (class `AuthenticatorMapper`) which provides a 
yes/no answer to the question, "does the user have access to this set of 
resource actions"?
   
   In MSQ today, there is a single resource action: `(EXTERNAL, EXTERNAL, 
READ)` which was introduced along with the `extern` table function. That is, 
that single permission allows the user to read data from S3, from the local 
disk, from HTTP, etc.
   
   The key gap in this model is the lack of fine-grain control over which input 
sources to allow. (A site may want to allow access to S3, but not the local 
file system on each ingest node.)
   
   ## Proposed Input Source Security Model
   
   The proposed model extends the current system to add additional items to the 
second (name) element of the resource action. Specifically, to use the JSON 
type name for each input source, so that access to the HTTP input source would 
be checked as `(EXTERNAL, http, READ)`. Similarly for all other Druid input 
sources.
   
   MSQ currently provides multiple ways to access a given input source: via the 
`extern` function, or via the newer input-source-specific functions such as 
`http` or `localfiles`. By applying security at the level of the input source, 
the same security rules apply regardless of the function used to access the 
input source.
   
   ## Security Model for External Tables Defined in the Catalog
   
   The Druid catalog provides the ability to define an external table: a 
metadata entry that can be used in MSQ queries in lieu of spelling out the 
input source details in each query. The catalog already uses security rules 
that mimic Druid datasources. To access an external table the user needs 
permission on `(ext, <table name>, READ)`, where `ext` is a new Druid schema 
introduced to hold external table definitions.
   
   We propose to extend the security model to _also_ need permission on the 
underlying input source type. Thus, if `myS3` is an external table that 
accesses S3, then the user needs both `(ext, myS3, READ)` and `(EXTERN, s3, 
READ)` permissions. 
   
   ## Security Model for Custom MSQ Table Functions
   
   A great strength of Druid is the ability to add new input sources and "user 
defined" SQL functions. These can be combined to provide new MSQ table 
functions for new input sources. The simplest such function is a wrapper. 
Suppose I want to ingest data from 
[IPFS](https://en.wikipedia.org/wiki/InterPlanetary_File_System). I would first 
write the input source, then provide an `ipfs` table function which would check 
the required `(EXTERNAL, ipfs, READ)` permissions. (Note that the "ipfs" in the 
resource action is the name of the input source, not the table function.)
   
   In another scenario, I might write a specialized table function to access an 
application's staging area. That staging area is on S3, but the user need not 
know that. (We might later change it to GCP.) In this case, the function would 
create a "virtual" input source, say, "abc-staging" and permissions would be 
granted on that name: `(EXTERNAL abc-staging, READ)`. The extension could use 
S3 internally, but that's an implementation detail.
   
   ## Code Revisions
   
   Because we are adding a security change, we must ensure _all_ of Druid 
honors the new rules, not just MSQ. An outline of the changes:
   
   * Modify each existing SQL table function to use `(EXTERNAL, <input source>, 
READ)` instead of `(EXTERNAL, EXTERNAL, READ)`. Note: this could be a breaking 
change for existing users.
   * Extend the catalog code to add the input source security check in addition 
to the external table check, as noted above.
   * Extend native batch ingest specs that use input sources to apply the new 
security checks
   * Determine how to apply the rules to native batch ingest that use firehose 
factories
   * Determine how to apply the rules to Hadoop ingest specs that use Hadoop FS 
paths
   * Determine how to apply the rules to various flavors of realtime ingest
   
   ## Documentation and Release Notes
   
   Documentation must explain the new permissions and security models. (Start 
with the text above.)
   
   Release notes should announce when this model is available. The note must 
clearly state that if customers are granting permissions on the existing MSQ 
`(EXTERNAL, EXTERNAL, READ)` resource action, the rules should instead grant 
access on `(EXTERNAL, * READ)` to avoid ingestion failures.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to