Re: [DISCUSS] AIP-1 and Airflow multi-tenancy

Andrew Godwin Wed, 14 Apr 2021 09:47:16 -0700

I'd quite like to be involved as well since this is something I'm very
interested in getting in - 8am PST on the 21st works fine, as well.


Andrew

On Wed, Apr 14, 2021 at 10:35 AM Daniel Imberman <daniel.imber...@gmail.com>
wrote:

> How about Wednesday, April 21 at 8:00AM PST?
>
> On Wed, Apr 14, 2021 at 9:33 AM, Xinbin Huang <bin.huan...@gmail.com>
> wrote:
>
> I am available any days.
>
> On Wed, Apr 14, 2021, 9:32 AM Daniel Imberman <daniel.imber...@gmail.com>
> wrote:
>
>> Hi everyone!
>>
>> Would people be available around 8AM/9AM PST some point next week? I’m in
>> PST and Ian is UTC+1 so would be great to find a timezone that accomodates
>> everyone.
>>
>> Daniel
>> On Wed, Apr 14, 2021 at 6:26 AM, Ryan Hatter <ryannhat...@gmail.com>
>> wrote:
>>
>> I’d also like to be added please :)
>>
>> On Apr 13, 2021, at 21:27, Xinbin Huang <bin.huan...@gmail.com> wrote:
>>
>> 
>> Hi Daniel & Ian,
>>
>> I am also interested in the idea of a serialization representation that
>> can be executed by workers directly. Can you also add me to the call?
>>
>> Thanks
>> Bin
>>
>> On Tue, Apr 13, 2021 at 2:49 PM Ian Buss <ianjb...@gmail.com> wrote:
>>
>>> Daniel,
>>>
>>> Thanks for your warm welcome and quick response and the advice on
>>> providers! Will certainly check out the examples you sent.
>>>
>>> 1. An "airflow register" command definitely sounds promising, would love
>>> to collaborate on an AIP there so let's set something up.
>>> 2. We use KubernetesExecutor exclusively as well. We've noticed
>>> significant additional load on the metadata DB as we scale up task pods so
>>> I've also thought about an API-based approach. Such an API could also open
>>> up the possibility of per-task security tokens which are injected by the
>>> scheduler, which should improve the security of such a system. Food for
>>> thought at least. I will start putting some of these thoughts down on paper
>>> in a sharable format.
>>>
>>> Ian
>>>
>>> On Tue, Apr 13, 2021 at 7:46 PM Daniel Imberman <
>>> daniel.imber...@gmail.com> wrote:
>>>
>>>> Hi Ian,
>>>>
>>>>
>>>> Firstly, welcome to the Airflow community :). I'm glad to hear you've
>>>> had a positive experience so far. It's great to hear that you want to
>>>> contribute back, and I think that multi-tenancy/DAG isolation is a pretty
>>>> fantastic project for the community as a whole (a lot of things are are
>>>> things we want but are limited by hours in a day).
>>>>
>>>>
>>>> 1. I've personally been kicking around some ideas lately about an
>>>> "airflow register" command that would write the DAG into the metadata DB in
>>>> a way that could be "gettable" by the workers via the API. This work is
>>>> very early. I'd love to get some help on it. Perhaps we can set up a zoom
>>>> chat to discuss drafting an AIP?
>>>>
>>>>
>>>> 2. Limiting worker access to the DB is not only good security practice;
>>>> it also opens up the door to a lot of valuable features. This feature would
>>>> be especially close to my heart as it would make the KubernetesExecutor
>>>> significantly more efficient. It should be possible to set up a system
>>>> where the workers only ever speak to an API server and never need to touch
>>>> the DB.
>>>>
>>>>
>>>> 3. This is not something I personally have insight into, but I think it
>>>> sounds like a good idea.
>>>>
>>>>
>>>> Finally, addressing your question about a Cloudera provider. If
>>>> anything, it would probably give the provider _more_ legitimacy if you
>>>> hosted it under the Cloudera GitHub org (we very purposely created the
>>>> provider packages with this workflow in mind). There are multiple places
>>>> where we can work to surface this provider so it is easy to find and use.
>>>>
>>>>
>>>> Astronomer has a pretty good sample provider here
>>>> <https://github.com/astronomer/airflow-provider-sample>. One example
>>>> of it running in the wild is the Great Expectations provider here
>>>> <https://github.com/great-expectations/airflow-provider-great-expectations>.
>>>> I'd also be glad to get you in contact with people who have built providers
>>>> in the past to help you with that process.
>>>>
>>>>
>>>> Looking forward to seeing some of these things come to fruition!
>>>>
>>>>
>>>> Daniel
>>>>
>>>> On Tue, Apr 13, 2021 at 9:43 AM, Ian Buss <ianjb...@gmail.com> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> First a quick introduction: I'm an engineer with Cloudera working on
>>>> our Data Engineering product (CDE). Airflow is working great for us so far.
>>>> We've been looking into how we can enhance the multi-tenancy story of
>>>> Apache Airflow as we currently deploy it. We have the following areas which
>>>> we'd like (with community consensus) to work on and contribute back to
>>>> Apache Airflow to enhance the isolation between tenants in a single Airflow
>>>> deployment.
>>>>
>>>> 1. Isolating code execution and parsing of DAG files. At the moment,
>>>> DAG files are parsed in a few locations in Airflow, including the scheduler
>>>> and in tasks. There is already the concept of DAG serialization (and we're
>>>> using that for the web component) but we'd be interested to see if we can
>>>> sandbox the execution of arbitrary user code to a locked down
>>>> process/container without full access to the metadata DB and connection
>>>> secrets etc. The idea would be to parse and serialize the DAG in this
>>>> isolated container and pass back a serialized representation for
>>>> persistence in the DB. Has anyone explored this idea?
>>>>
>>>> 2. Limiting task access to the metadata DB. It would be great if we
>>>> could remove the requirement for tasks to have full access to the metadata
>>>> DB and to report task status in a different (but still scalable) way. We'd
>>>> need to tackle access or injection of connection, variable and xcom data as
>>>> well for each task naturally.
>>>>
>>>> 3. Finer-grained access controls on connection secrets. Right now,
>>>> although there are nice at-rest encryption options with Fernet or Vault,
>>>> IIUC any DAG can access any connection (and thus any secret). Since the
>>>> "run as" user is largely defined within the DAG and its tasks, this is
>>>> challenging for a multi-tenant environment (see caveat below)
>>>>
>>>> Caveat: It's definitely noted that to some extent we should assume that
>>>> an Airflow deployment is a "trusted" environment and that best practices
>>>> such as git+PR workflows are the gold standard and that any malicious code
>>>> and dependencies should be identified through this process. Also that there
>>>> is a clear admin role for connection management etc.
>>>>
>>>> We have some ideas informally sketched out as to how to address the
>>>> above but would be keen to hear the community opinion on this and to see if
>>>> anyone is keen to collaborate on designs and implementation, or to hear if
>>>> anything is already in the works. In particular I noticed that the very
>>>> first improvement proposal (AIP-1) addresses much of the above :). However,
>>>> it seems fairly dormant at the moment.
>>>>
>>>> One other question: we have a provider (operators and hooks) for
>>>> interacting with Cloudera components that we'd like to contribute to the
>>>> project. The provider FAQs indicate that new provider contributions are
>>>> still welcome in the project in 2.x, is that accurate?
>>>>
>>>> Thanks in advance!
>>>>
>>>> Ian
>>>>
>>>>

Re: [DISCUSS] AIP-1 and Airflow multi-tenancy

Reply via email to