Hi Ian,
Firstly, welcome to the Airflow community :). I'm glad to hear you've had a
positive experience so far. It's great to hear that you want to contribute
back, and I think that multi-tenancy/DAG isolation is a pretty fantastic
project for the community as a whole (a lot of things are are things we
want but are limited by hours in a day).
1. I've personally been kicking around some ideas lately about an "airflow
register" command that would write the DAG into the metadata DB in a way
that could be "gettable" by the workers via the API. This work is very
early. I'd love to get some help on it. Perhaps we can set up a zoom chat
to discuss drafting an AIP?
2. Limiting worker access to the DB is not only good security practice; it
also opens up the door to a lot of valuable features. This feature would be
especially close to my heart as it would make the KubernetesExecutor
significantly more efficient. It should be possible to set up a system
where the workers only ever speak to an API server and never need to touch
the DB.
3. This is not something I personally have insight into, but I think it
sounds like a good idea.
Finally, addressing your question about a Cloudera provider. If anything,
it would probably give the provider _more_ legitimacy if you hosted it
under the Cloudera GitHub org (we very purposely created the provider
packages with this workflow in mind). There are multiple places where we
can work to surface this provider so it is easy to find and use.
Astronomer has a pretty good sample provider here
[https://github.com/astronomer/airflow-provider-sample] . One example of it
running in the wild is the Great Expectations provider here
[https://github.com/great-expectations/airflow-provider-great-expectations]
. I'd also be glad to get you in contact with people who have built
providers in the past to help you with that process.
Looking forward to seeing some of these things come to fruition!
Daniel
On Tue, Apr 13, 2021 at 9:43 AM, Ian Buss <ianjb...@gmail.com> wrote:
Hi all,
First a quick introduction: I'm an engineer with Cloudera working on our
Data Engineering product (CDE). Airflow is working great for us so far.
We've been looking into how we can enhance the multi-tenancy story of
Apache Airflow as we currently deploy it. We have the following areas which
we'd like (with community consensus) to work on and contribute back to
Apache Airflow to enhance the isolation between tenants in a single Airflow
deployment.
1. Isolating code execution and parsing of DAG files. At the moment, DAG
files are parsed in a few locations in Airflow, including the scheduler and
in tasks. There is already the concept of DAG serialization (and we're
using that for the web component) but we'd be interested to see if we can
sandbox the execution of arbitrary user code to a locked down
process/container without full access to the metadata DB and connection
secrets etc. The idea would be to parse and serialize the DAG in this
isolated container and pass back a serialized representation for
persistence in the DB. Has anyone explored this idea?
2. Limiting task access to the metadata DB. It would be great if we could
remove the requirement for tasks to have full access to the metadata DB and
to report task status in a different (but still scalable) way. We'd need to
tackle access or injection of connection, variable and xcom data as well
for each task naturally.
3. Finer-grained access controls on connection secrets. Right now, although
there are nice at-rest encryption options with Fernet or Vault, IIUC any
DAG can access any connection (and thus any secret). Since the "run as"
user is largely defined within the DAG and its tasks, this is challenging
for a multi-tenant environment (see caveat below)
Caveat: It's definitely noted that to some extent we should assume that an
Airflow deployment is a "trusted" environment and that best practices such
as git+PR workflows are the gold standard and that any malicious code and
dependencies should be identified through this process. Also that there is
a clear admin role for connection management etc.
We have some ideas informally sketched out as to how to address the above
but would be keen to hear the community opinion on this and to see if
anyone is keen to collaborate on designs and implementation, or to hear if
anything is already in the works. In particular I noticed that the very
first improvement proposal (AIP-1) addresses much of the above :). However,
it seems fairly dormant at the moment.
One other question: we have a provider (operators and hooks) for
interacting with Cloudera components that we'd like to contribute to the
project. The provider FAQs indicate that new provider contributions are
still welcome in the project in 2.x, is that accurate?
Thanks in advance!
Ian