Re: [DISCUSS] AIP-1 and Airflow multi-tenancy

Xinbin Huang Tue, 13 Apr 2021 18:27:01 -0700

Hi Daniel & Ian,

I am also interested in the idea of a serialization representation that can
be executed by workers directly. Can you also add me to the call?


Thanks
Bin

On Tue, Apr 13, 2021 at 2:49 PM Ian Buss <[email protected]> wrote:

> Daniel,
>
> Thanks for your warm welcome and quick response and the advice on
> providers! Will certainly check out the examples you sent.
>
> 1. An "airflow register" command definitely sounds promising, would love
> to collaborate on an AIP there so let's set something up.
> 2. We use KubernetesExecutor exclusively as well. We've noticed
> significant additional load on the metadata DB as we scale up task pods so
> I've also thought about an API-based approach. Such an API could also open
> up the possibility of per-task security tokens which are injected by the
> scheduler, which should improve the security of such a system. Food for
> thought at least. I will start putting some of these thoughts down on paper
> in a sharable format.
>
> Ian
>
> On Tue, Apr 13, 2021 at 7:46 PM Daniel Imberman <[email protected]>
> wrote:
>
>> Hi Ian,
>>
>>
>> Firstly, welcome to the Airflow community :). I'm glad to hear you've had
>> a positive experience so far. It's great to hear that you want to
>> contribute back, and I think that multi-tenancy/DAG isolation is a pretty
>> fantastic project for the community as a whole (a lot of things are are
>> things we want but are limited by hours in a day).
>>
>>
>> 1. I've personally been kicking around some ideas lately about an
>> "airflow register" command that would write the DAG into the metadata DB in
>> a way that could be "gettable" by the workers via the API. This work is
>> very early. I'd love to get some help on it. Perhaps we can set up a zoom
>> chat to discuss drafting an AIP?
>>
>>
>> 2. Limiting worker access to the DB is not only good security practice;
>> it also opens up the door to a lot of valuable features. This feature would
>> be especially close to my heart as it would make the KubernetesExecutor
>> significantly more efficient. It should be possible to set up a system
>> where the workers only ever speak to an API server and never need to touch
>> the DB.
>>
>>
>> 3. This is not something I personally have insight into, but I think it
>> sounds like a good idea.
>>
>>
>> Finally, addressing your question about a Cloudera provider. If anything,
>> it would probably give the provider _more_ legitimacy if you hosted it
>> under the Cloudera GitHub org (we very purposely created the provider
>> packages with this workflow in mind). There are multiple places where we
>> can work to surface this provider so it is easy to find and use.
>>
>>
>> Astronomer has a pretty good sample provider here
>> <https://github.com/astronomer/airflow-provider-sample>. One example of
>> it running in the wild is the Great Expectations provider here
>> <https://github.com/great-expectations/airflow-provider-great-expectations>.
>> I'd also be glad to get you in contact with people who have built providers
>> in the past to help you with that process.
>>
>>
>> Looking forward to seeing some of these things come to fruition!
>>
>>
>> Daniel
>>
>> On Tue, Apr 13, 2021 at 9:43 AM, Ian Buss <[email protected]> wrote:
>>
>> Hi all,
>>
>> First a quick introduction: I'm an engineer with Cloudera working on our
>> Data Engineering product (CDE). Airflow is working great for us so far.
>> We've been looking into how we can enhance the multi-tenancy story of
>> Apache Airflow as we currently deploy it. We have the following areas which
>> we'd like (with community consensus) to work on and contribute back to
>> Apache Airflow to enhance the isolation between tenants in a single Airflow
>> deployment.
>>
>> 1. Isolating code execution and parsing of DAG files. At the moment, DAG
>> files are parsed in a few locations in Airflow, including the scheduler and
>> in tasks. There is already the concept of DAG serialization (and we're
>> using that for the web component) but we'd be interested to see if we can
>> sandbox the execution of arbitrary user code to a locked down
>> process/container without full access to the metadata DB and connection
>> secrets etc. The idea would be to parse and serialize the DAG in this
>> isolated container and pass back a serialized representation for
>> persistence in the DB. Has anyone explored this idea?
>>
>> 2. Limiting task access to the metadata DB. It would be great if we could
>> remove the requirement for tasks to have full access to the metadata DB and
>> to report task status in a different (but still scalable) way. We'd need to
>> tackle access or injection of connection, variable and xcom data as well
>> for each task naturally.
>>
>> 3. Finer-grained access controls on connection secrets. Right now,
>> although there are nice at-rest encryption options with Fernet or Vault,
>> IIUC any DAG can access any connection (and thus any secret). Since the
>> "run as" user is largely defined within the DAG and its tasks, this is
>> challenging for a multi-tenant environment (see caveat below)
>>
>> Caveat: It's definitely noted that to some extent we should assume that
>> an Airflow deployment is a "trusted" environment and that best practices
>> such as git+PR workflows are the gold standard and that any malicious code
>> and dependencies should be identified through this process. Also that there
>> is a clear admin role for connection management etc.
>>
>> We have some ideas informally sketched out as to how to address the above
>> but would be keen to hear the community opinion on this and to see if
>> anyone is keen to collaborate on designs and implementation, or to hear if
>> anything is already in the works. In particular I noticed that the very
>> first improvement proposal (AIP-1) addresses much of the above :). However,
>> it seems fairly dormant at the moment.
>>
>> One other question: we have a provider (operators and hooks) for
>> interacting with Cloudera components that we'd like to contribute to the
>> project. The provider FAQs indicate that new provider contributions are
>> still welcome in the project in 2.x, is that accurate?
>>
>> Thanks in advance!
>>
>> Ian
>>
>>

Re: [DISCUSS] AIP-1 and Airflow multi-tenancy

Reply via email to