Re: [DISCUSS] AIP-1 and Airflow multi-tenancy

Daniel Imberman Tue, 13 Apr 2021 11:46:14 -0700

Hi Ian,

Firstly, welcome to the Airflow community :). I'm glad to hear you've had apositive experience so far. It's great to hear that you want to contributeback, and I think that multi-tenancy/DAG isolation is a pretty fantasticproject for the community as a whole (a lot of things are are things wewant but are limited by hours in a day).

1. I've personally been kicking around some ideas lately about an "airflowregister" command that would write the DAG into the metadata DB in a waythat could be "gettable" by the workers via the API. This work is veryearly. I'd love to get some help on it. Perhaps we can set up a zoom chatto discuss drafting an AIP?

2. Limiting worker access to the DB is not only good security practice; italso opens up the door to a lot of valuable features. This feature would beespecially close to my heart as it would make the KubernetesExecutorsignificantly more efficient. It should be possible to set up a systemwhere the workers only ever speak to an API server and never need to touchthe DB.

3. This is not something I personally have insight into, but I think itsounds like a good idea.

Finally, addressing your question about a Cloudera provider. If anything,it would probably give the provider _more_ legitimacy if you hosted itunder the Cloudera GitHub org (we very purposely created the providerpackages with this workflow in mind). There are multiple places where wecan work to surface this provider so it is easy to find and use.

Astronomer has a pretty good sample provider here[https://github.com/astronomer/airflow-provider-sample] . One example of itrunning in the wild is the Great Expectations provider here[https://github.com/great-expectations/airflow-provider-great-expectations]. I'd also be glad to get you in contact with people who have builtproviders in the past to help you with that process.





Looking forward to seeing some of these things come to fruition!




Daniel


On Tue, Apr 13, 2021 at 9:43 AM, Ian Buss <ianjb...@gmail.com> wrote:
Hi all,

First a quick introduction: I'm an engineer with Cloudera working on ourData Engineering product (CDE). Airflow is working great for us so far.We've been looking into how we can enhance the multi-tenancy story ofApache Airflow as we currently deploy it. We have the following areas whichwe'd like (with community consensus) to work on and contribute back toApache Airflow to enhance the isolation between tenants in a single Airflowdeployment.1. Isolating code execution and parsing of DAG files. At the moment, DAGfiles are parsed in a few locations in Airflow, including the scheduler andin tasks. There is already the concept of DAG serialization (and we'reusing that for the web component) but we'd be interested to see if we cansandbox the execution of arbitrary user code to a locked downprocess/container without full access to the metadata DB and connectionsecrets etc. The idea would be to parse and serialize the DAG in thisisolated container and pass back a serialized representation forpersistence in the DB. Has anyone explored this idea?2. Limiting task access to the metadata DB. It would be great if we couldremove the requirement for tasks to have full access to the metadata DB andto report task status in a different (but still scalable) way. We'd need totackle access or injection of connection, variable and xcom data as wellfor each task naturally.3. Finer-grained access controls on connection secrets. Right now, althoughthere are nice at-rest encryption options with Fernet or Vault, IIUC anyDAG can access any connection (and thus any secret). Since the "run as"user is largely defined within the DAG and its tasks, this is challengingfor a multi-tenant environment (see caveat below)Caveat: It's definitely noted that to some extent we should assume that anAirflow deployment is a "trusted" environment and that best practices suchas git+PR workflows are the gold standard and that any malicious code anddependencies should be identified through this process. Also that there isa clear admin role for connection management etc.We have some ideas informally sketched out as to how to address the abovebut would be keen to hear the community opinion on this and to see ifanyone is keen to collaborate on designs and implementation, or to hear ifanything is already in the works. In particular I noticed that the veryfirst improvement proposal (AIP-1) addresses much of the above :). However,it seems fairly dormant at the moment.One other question: we have a provider (operators and hooks) forinteracting with Cloudera components that we'd like to contribute to theproject. The provider FAQs indicate that new provider contributions arestill welcome in the project in 2.x, is that accurate?

Thanks in advance!
Ian

Re: [DISCUSS] AIP-1 and Airflow multi-tenancy

Reply via email to