Hi all, I actually can’t do Wednesday next week as I’m moving house :) Any chance we could do Thursday or Friday at the same time?
Cheers Ian On 14 Apr 2021, 17:49 +0100, Kaxil Naik <kaxiln...@gmail.com>, wrote: > Just few comments here: > > Currently -- atleast for the foreseeable future Airflow workers will need > access to the DAG Files, so workers can not run using the Serialized DAGs. > > Also serialized DAGs do not even have all the info needed for it to run it. > Currently the serialization happens in the parsing process in the scheduler > which can be in future separated as a separator "parsining" component, but > that won't solve the "isolation" problem you are trying to solve. The only > current way it can be solved is pickling -- and we have strictly decided > against using pickling for DAGs. > > The idea in Statement (2) & (3) would help solve the isolation problem in (1) > and can be done with some work now. > > Happy to talk about it in more detail here or on call, the time Daniel > suggested works for me. > > Regards, > Kaxil > > > On Wed, Apr 14, 2021 at 5:35 PM Daniel Imberman <daniel.imber...@gmail.com> > > wrote: > > > How about Wednesday, April 21 at 8:00AM PST? > > > > > > On Wed, Apr 14, 2021 at 9:33 AM, Xinbin Huang <bin.huan...@gmail.com> > > > wrote: > > > > I am available any days. > > > > > > > > > On Wed, Apr 14, 2021, 9:32 AM Daniel Imberman > > > > > <daniel.imber...@gmail.com> wrote: > > > > > > Hi everyone! > > > > > > > > > > > > Would people be available around 8AM/9AM PST some point next week? > > > > > > I’m in PST and Ian is UTC+1 so would be great to find a timezone > > > > > > that accomodates everyone. > > > > > > > > > > > > Daniel > > > > > > On Wed, Apr 14, 2021 at 6:26 AM, Ryan Hatter > > > > > > <ryannhat...@gmail.com> wrote: > > > > > > > I’d also like to be added please :) > > > > > > > > > > > > > > > On Apr 13, 2021, at 21:27, Xinbin Huang <bin.huan...@gmail.com> > > > > > > > > wrote: > > > > > > > > > > > > > > > > Hi Daniel & Ian, > > > > > > > > > > > > > > > > I am also interested in the idea of a serialization > > > > > > > > representation that can be executed by workers directly. Can > > > > > > > > you also add me to the call? > > > > > > > > > > > > > > > > Thanks > > > > > > > > Bin > > > > > > > > > > > > > > > > > On Tue, Apr 13, 2021 at 2:49 PM Ian Buss <ianjb...@gmail.com> > > > > > > > > > wrote: > > > > > > > > > > Daniel, > > > > > > > > > > > > > > > > > > > > Thanks for your warm welcome and quick response and the > > > > > > > > > > advice on providers! Will certainly check out the examples > > > > > > > > > > you sent. > > > > > > > > > > > > > > > > > > > > 1. An "airflow register" command definitely sounds > > > > > > > > > > promising, would love to collaborate on an AIP there so > > > > > > > > > > let's set something up. > > > > > > > > > > 2. We use KubernetesExecutor exclusively as well. We've > > > > > > > > > > noticed significant additional load on the metadata DB as > > > > > > > > > > we scale up task pods so I've also thought about an > > > > > > > > > > API-based approach. Such an API could also open up the > > > > > > > > > > possibility of per-task security tokens which are injected > > > > > > > > > > by the scheduler, which should improve the security of such > > > > > > > > > > a system. Food for thought at least. I will start putting > > > > > > > > > > some of these thoughts down on paper in a sharable format. > > > > > > > > > > > > > > > > > > > > Ian > > > > > > > > > > > > > > > > > > > > > On Tue, Apr 13, 2021 at 7:46 PM Daniel Imberman > > > > > > > > > > > <daniel.imber...@gmail.com> wrote: > > > > > > > > > > > > Hi Ian, > > > > > > > > > > > > > > > > > > > > > > > > Firstly, welcome to the Airflow community :). I'm glad > > > > > > > > > > > > to hear you've had a positive experience so far. It's > > > > > > > > > > > > great to hear that you want to contribute back, and I > > > > > > > > > > > > think that multi-tenancy/DAG isolation is a pretty > > > > > > > > > > > > fantastic project for the community as a whole (a lot > > > > > > > > > > > > of things are are things we want but are limited by > > > > > > > > > > > > hours in a day). > > > > > > > > > > > > > > > > > > > > > > > > 1. I've personally been kicking around some ideas > > > > > > > > > > > > lately about an "airflow register" command that would > > > > > > > > > > > > write the DAG into the metadata DB in a way that could > > > > > > > > > > > > be "gettable" by the workers via the API. This work is > > > > > > > > > > > > very early. I'd love to get some help on it. Perhaps we > > > > > > > > > > > > can set up a zoom chat to discuss drafting an AIP? > > > > > > > > > > > > > > > > > > > > > > > > 2. Limiting worker access to the DB is not only good > > > > > > > > > > > > security practice; it also opens up the door to a lot > > > > > > > > > > > > of valuable features. This feature would be especially > > > > > > > > > > > > close to my heart as it would make the > > > > > > > > > > > > KubernetesExecutor significantly more efficient. It > > > > > > > > > > > > should be possible to set up a system where the workers > > > > > > > > > > > > only ever speak to an API server and never need to > > > > > > > > > > > > touch the DB. > > > > > > > > > > > > > > > > > > > > > > > > 3. This is not something I personally have insight > > > > > > > > > > > > into, but I think it sounds like a good idea. > > > > > > > > > > > > > > > > > > > > > > > > Finally, addressing your question about a Cloudera > > > > > > > > > > > > provider. If anything, it would probably give the > > > > > > > > > > > > provider _more_ legitimacy if you hosted it under the > > > > > > > > > > > > Cloudera GitHub org (we very purposely created the > > > > > > > > > > > > provider packages with this workflow in mind). There > > > > > > > > > > > > are multiple places where we can work to surface this > > > > > > > > > > > > provider so it is easy to find and use. > > > > > > > > > > > > > > > > > > > > > > > > Astronomer has a pretty good sample provider here. One > > > > > > > > > > > > example of it running in the wild is the Great > > > > > > > > > > > > Expectations provider here. I'd also be glad to get you > > > > > > > > > > > > in contact with people who have built providers in the > > > > > > > > > > > > past to help you with that process. > > > > > > > > > > > > > > > > > > > > > > > > Looking forward to seeing some of these things come to > > > > > > > > > > > > fruition! > > > > > > > > > > > > > > > > > > > > > > > > Daniel > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Apr 13, 2021 at 9:43 AM, Ian Buss > > > > > > > > > > > > <ianjb...@gmail.com> wrote: > > > > > > > > > > > > > Hi all, > > > > > > > > > > > > > > > > > > > > > > > > > > First a quick introduction: I'm an engineer with > > > > > > > > > > > > > Cloudera working on our Data Engineering product > > > > > > > > > > > > > (CDE). Airflow is working great for us so far. We've > > > > > > > > > > > > > been looking into how we can enhance the > > > > > > > > > > > > > multi-tenancy story of Apache Airflow as we currently > > > > > > > > > > > > > deploy it. We have the following areas which we'd > > > > > > > > > > > > > like (with community consensus) to work on and > > > > > > > > > > > > > contribute back to Apache Airflow to enhance the > > > > > > > > > > > > > isolation between tenants in a single Airflow > > > > > > > > > > > > > deployment. > > > > > > > > > > > > > > > > > > > > > > > > > > 1. Isolating code execution and parsing of DAG files. > > > > > > > > > > > > > At the moment, DAG files are parsed in a few > > > > > > > > > > > > > locations in Airflow, including the scheduler and in > > > > > > > > > > > > > tasks. There is already the concept of DAG > > > > > > > > > > > > > serialization (and we're using that for the web > > > > > > > > > > > > > component) but we'd be interested to see if we can > > > > > > > > > > > > > sandbox the execution of arbitrary user code to a > > > > > > > > > > > > > locked down process/container without full access to > > > > > > > > > > > > > the metadata DB and connection secrets etc. The idea > > > > > > > > > > > > > would be to parse and serialize the DAG in this > > > > > > > > > > > > > isolated container and pass back a serialized > > > > > > > > > > > > > representation for persistence in the DB. Has anyone > > > > > > > > > > > > > explored this idea? > > > > > > > > > > > > > > > > > > > > > > > > > > 2. Limiting task access to the metadata DB. It would > > > > > > > > > > > > > be great if we could remove the requirement for tasks > > > > > > > > > > > > > to have full access to the metadata DB and to report > > > > > > > > > > > > > task status in a different (but still scalable) way. > > > > > > > > > > > > > We'd need to tackle access or injection of > > > > > > > > > > > > > connection, variable and xcom data as well for each > > > > > > > > > > > > > task naturally. > > > > > > > > > > > > > > > > > > > > > > > > > > 3. Finer-grained access controls on connection > > > > > > > > > > > > > secrets. Right now, although there are nice at-rest > > > > > > > > > > > > > encryption options with Fernet or Vault, IIUC any DAG > > > > > > > > > > > > > can access any connection (and thus any secret). > > > > > > > > > > > > > Since the "run as" user is largely defined within the > > > > > > > > > > > > > DAG and its tasks, this is challenging for a > > > > > > > > > > > > > multi-tenant environment (see caveat below) > > > > > > > > > > > > > > > > > > > > > > > > > > Caveat: It's definitely noted that to some extent we > > > > > > > > > > > > > should assume that an Airflow deployment is a > > > > > > > > > > > > > "trusted" environment and that best practices such as > > > > > > > > > > > > > git+PR workflows are the gold standard and that any > > > > > > > > > > > > > malicious code and dependencies should be identified > > > > > > > > > > > > > through this process. Also that there is a clear > > > > > > > > > > > > > admin role for connection management etc. > > > > > > > > > > > > > > > > > > > > > > > > > > We have some ideas informally sketched out as to how > > > > > > > > > > > > > to address the above but would be keen to hear the > > > > > > > > > > > > > community opinion on this and to see if anyone is > > > > > > > > > > > > > keen to collaborate on designs and implementation, or > > > > > > > > > > > > > to hear if anything is already in the works. In > > > > > > > > > > > > > particular I noticed that the very first improvement > > > > > > > > > > > > > proposal (AIP-1) addresses much of the above :). > > > > > > > > > > > > > However, it seems fairly dormant at the moment. > > > > > > > > > > > > > > > > > > > > > > > > > > One other question: we have a provider (operators and > > > > > > > > > > > > > hooks) for interacting with Cloudera components that > > > > > > > > > > > > > we'd like to contribute to the project. The provider > > > > > > > > > > > > > FAQs indicate that new provider contributions are > > > > > > > > > > > > > still welcome in the project in 2.x, is that accurate? > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks in advance! > > > > > > > > > > > > > > > > > > > > > > > > > > Ian