Hi Daniel & Ian, I am also interested in the idea of a serialization representation that can be executed by workers directly. Can you also add me to the call?
Thanks Bin On Tue, Apr 13, 2021 at 2:49 PM Ian Buss <ianjb...@gmail.com> wrote: > Daniel, > > Thanks for your warm welcome and quick response and the advice on > providers! Will certainly check out the examples you sent. > > 1. An "airflow register" command definitely sounds promising, would love > to collaborate on an AIP there so let's set something up. > 2. We use KubernetesExecutor exclusively as well. We've noticed > significant additional load on the metadata DB as we scale up task pods so > I've also thought about an API-based approach. Such an API could also open > up the possibility of per-task security tokens which are injected by the > scheduler, which should improve the security of such a system. Food for > thought at least. I will start putting some of these thoughts down on paper > in a sharable format. > > Ian > > On Tue, Apr 13, 2021 at 7:46 PM Daniel Imberman <daniel.imber...@gmail.com> > wrote: > >> Hi Ian, >> >> >> Firstly, welcome to the Airflow community :). I'm glad to hear you've had >> a positive experience so far. It's great to hear that you want to >> contribute back, and I think that multi-tenancy/DAG isolation is a pretty >> fantastic project for the community as a whole (a lot of things are are >> things we want but are limited by hours in a day). >> >> >> 1. I've personally been kicking around some ideas lately about an >> "airflow register" command that would write the DAG into the metadata DB in >> a way that could be "gettable" by the workers via the API. This work is >> very early. I'd love to get some help on it. Perhaps we can set up a zoom >> chat to discuss drafting an AIP? >> >> >> 2. Limiting worker access to the DB is not only good security practice; >> it also opens up the door to a lot of valuable features. This feature would >> be especially close to my heart as it would make the KubernetesExecutor >> significantly more efficient. It should be possible to set up a system >> where the workers only ever speak to an API server and never need to touch >> the DB. >> >> >> 3. This is not something I personally have insight into, but I think it >> sounds like a good idea. >> >> >> Finally, addressing your question about a Cloudera provider. If anything, >> it would probably give the provider _more_ legitimacy if you hosted it >> under the Cloudera GitHub org (we very purposely created the provider >> packages with this workflow in mind). There are multiple places where we >> can work to surface this provider so it is easy to find and use. >> >> >> Astronomer has a pretty good sample provider here >> <https://github.com/astronomer/airflow-provider-sample>. One example of >> it running in the wild is the Great Expectations provider here >> <https://github.com/great-expectations/airflow-provider-great-expectations>. >> I'd also be glad to get you in contact with people who have built providers >> in the past to help you with that process. >> >> >> Looking forward to seeing some of these things come to fruition! >> >> >> Daniel >> >> On Tue, Apr 13, 2021 at 9:43 AM, Ian Buss <ianjb...@gmail.com> wrote: >> >> Hi all, >> >> First a quick introduction: I'm an engineer with Cloudera working on our >> Data Engineering product (CDE). Airflow is working great for us so far. >> We've been looking into how we can enhance the multi-tenancy story of >> Apache Airflow as we currently deploy it. We have the following areas which >> we'd like (with community consensus) to work on and contribute back to >> Apache Airflow to enhance the isolation between tenants in a single Airflow >> deployment. >> >> 1. Isolating code execution and parsing of DAG files. At the moment, DAG >> files are parsed in a few locations in Airflow, including the scheduler and >> in tasks. There is already the concept of DAG serialization (and we're >> using that for the web component) but we'd be interested to see if we can >> sandbox the execution of arbitrary user code to a locked down >> process/container without full access to the metadata DB and connection >> secrets etc. The idea would be to parse and serialize the DAG in this >> isolated container and pass back a serialized representation for >> persistence in the DB. Has anyone explored this idea? >> >> 2. Limiting task access to the metadata DB. It would be great if we could >> remove the requirement for tasks to have full access to the metadata DB and >> to report task status in a different (but still scalable) way. We'd need to >> tackle access or injection of connection, variable and xcom data as well >> for each task naturally. >> >> 3. Finer-grained access controls on connection secrets. Right now, >> although there are nice at-rest encryption options with Fernet or Vault, >> IIUC any DAG can access any connection (and thus any secret). Since the >> "run as" user is largely defined within the DAG and its tasks, this is >> challenging for a multi-tenant environment (see caveat below) >> >> Caveat: It's definitely noted that to some extent we should assume that >> an Airflow deployment is a "trusted" environment and that best practices >> such as git+PR workflows are the gold standard and that any malicious code >> and dependencies should be identified through this process. Also that there >> is a clear admin role for connection management etc. >> >> We have some ideas informally sketched out as to how to address the above >> but would be keen to hear the community opinion on this and to see if >> anyone is keen to collaborate on designs and implementation, or to hear if >> anything is already in the works. In particular I noticed that the very >> first improvement proposal (AIP-1) addresses much of the above :). However, >> it seems fairly dormant at the moment. >> >> One other question: we have a provider (operators and hooks) for >> interacting with Cloudera components that we'd like to contribute to the >> project. The provider FAQs indicate that new provider contributions are >> still welcome in the project in 2.x, is that accurate? >> >> Thanks in advance! >> >> Ian >> >>