paul-rogers opened a new issue, #13837: URL: https://github.com/apache/druid/issues/13837
MSQ provides a powerful way to ingest data into Druid using SQL, placing ingestion in the hands of more users. Prior to MSQ, only those users with sufficient knowledge to use a batch ingestion spec could ingest data. With the wider audience comes the need to add protections to prevent users from accessing data that doesn't make sense in a particular Druid deployment. This proposal outlines a proposed extension to the Druid security model to address this issue. As it turns out the Druid catalog feature has similar needs, which we also address. ## Input Source Security in Druid Today Let us start by reviewing the current security model. Druid's model has two parts: * A "resource action" (class `ResourceAction`) which is essentially a triple of (category, name, read/write). * An "authenticator mapper (class `AuthenticatorMapper`) which provides a yes/no answer to the question, "does the user have access to this set of resource actions"? In MSQ today, there is a single resource action: `(EXTERNAL, EXTERNAL, READ)` which was introduced along with the `extern` table function. That is, that single permission allows the user to read data from S3, from the local disk, from HTTP, etc. The key gap in this model is the lack of fine-grain control over which input sources to allow. (A site may want to allow access to S3, but not the local file system on each ingest node.) ## Proposed Input Source Security Model The proposed model extends the current system to add additional items to the second (name) element of the resource action. Specifically, to use the JSON type name for each input source, so that access to the HTTP input source would be checked as `(EXTERNAL, http, READ)`. Similarly for all other Druid input sources. MSQ currently provides multiple ways to access a given input source: via the `extern` function, or via the newer input-source-specific functions such as `http` or `localfiles`. By applying security at the level of the input source, the same security rules apply regardless of the function used to access the input source. ## Security Model for External Tables Defined in the Catalog The Druid catalog provides the ability to define an external table: a metadata entry that can be used in MSQ queries in lieu of spelling out the input source details in each query. The catalog already uses security rules that mimic Druid datasources. To access an external table the user needs permission on `(ext, <table name>, READ)`, where `ext` is a new Druid schema introduced to hold external table definitions. We propose to extend the security model to _also_ need permission on the underlying input source type. Thus, if `myS3` is an external table that accesses S3, then the user needs both `(ext, myS3, READ)` and `(EXTERN, s3, READ)` permissions. ## Security Model for Custom MSQ Table Functions A great strength of Druid is the ability to add new input sources and "user defined" SQL functions. These can be combined to provide new MSQ table functions for new input sources. The simplest such function is a wrapper. Suppose I want to ingest data from [IPFS](https://en.wikipedia.org/wiki/InterPlanetary_File_System). I would first write the input source, then provide an `ipfs` table function which would check the required `(EXTERNAL, ipfs, READ)` permissions. (Note that the "ipfs" in the resource action is the name of the input source, not the table function.) In another scenario, I might write a specialized table function to access an application's staging area. That staging area is on S3, but the user need not know that. (We might later change it to GCP.) In this case, the function would create a "virtual" input source, say, "abc-staging" and permissions would be granted on that name: `(EXTERNAL abc-staging, READ)`. The extension could use S3 internally, but that's an implementation detail. ## Code Revisions Because we are adding a security change, we must ensure _all_ of Druid honors the new rules, not just MSQ. An outline of the changes: * Modify each existing SQL table function to use `(EXTERNAL, <input source>, READ)` instead of `(EXTERNAL, EXTERNAL, READ)`. Note: this could be a breaking change for existing users. * Extend the catalog code to add the input source security check in addition to the external table check, as noted above. * Extend native batch ingest specs that use input sources to apply the new security checks * Determine how to apply the rules to native batch ingest that use firehose factories * Determine how to apply the rules to Hadoop ingest specs that use Hadoop FS paths * Determine how to apply the rules to various flavors of realtime ingest ## Documentation and Release Notes Documentation must explain the new permissions and security models. (Start with the text above.) Release notes should announce when this model is available. The note must clearly state that if customers are granting permissions on the existing MSQ `(EXTERNAL, EXTERNAL, READ)` resource action, the rules should instead grant access on `(EXTERNAL, * READ)` to avoid ingestion failures. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
