I'll join as well (I believe the zoom link will work without an invite) On Wed, Apr 21, 2021 at 10:48 AM Dimitris Stafylarakis <xan...@gmail.com> wrote:
> hi all, > > great to read about this, I'd like to join in! Can I just join using the > zoom link tomorrow or do I need an invitation? (If I do need one, please > invite me :)) > > cheers > > > On Wed, Apr 14, 2021 at 8:15 PM Daniel Imberman <daniel.imber...@gmail.com> > wrote: > >> Thank you Ian, >> >> I’ve invited everyone on this thread to the meeting with that zoom link. >> Anyone else who wants to join can add the calendar event here >> calendar.google.com/event?action=TEMPLATE&tmeid=Mm4zN2Q3MnFwNnBqbW9hMmNocXMyNzJpdHYgZGFuaWVsQGFzdHJvbm9tZXIuaW8&tmsrc=dan...@astronomer.io >> <https://calendar.google.com/event?action=TEMPLATE&tmeid=Mm4zN2Q3MnFwNnBqbW9hMmNocXMyNzJpdHYgZGFuaWVsQGFzdHJvbm9tZXIuaW8&tmsrc=daniel%40astronomer.io> >> >> On Wed, Apr 14, 2021 at 11:05 AM, Ian Buss <ianjb...@gmail.com> wrote: >> >> If this works for everyone, here's a zoom link for Thursday 8AM PST: >> https://cloudera.zoom.us/j/99928254235?pwd=VTFlQk4vQjQ5Z2JzUDM3ZWZKKy9MQT09 >> >> Happy to move or use an alternate method as needed. >> >> On Wed, Apr 14, 2021 at 6:58 PM Daniel Imberman < >> daniel.imber...@gmail.com> wrote: >> >>> Thursday works for me! >>> >>> On Wed, Apr 14, 2021 at 10:05 AM, Ian Buss <ianjb...@gmail.com> wrote: >>> >>> Hi all, >>> >>> I actually can’t do Wednesday next week as I’m moving house :) Any >>> chance we could do Thursday or Friday at the same time? >>> >>> Cheers >>> >>> Ian >>> On 14 Apr 2021, 17:49 +0100, Kaxil Naik <kaxiln...@gmail.com>, wrote: >>> >>> Just few comments here: >>> >>> Currently -- atleast for the foreseeable future Airflow workers will >>> need access to the DAG Files, so workers can not run using the Serialized >>> DAGs. >>> >>> Also serialized DAGs do not even have all the info needed for it to run >>> it. Currently the serialization happens in the parsing process in the >>> scheduler which can be in future separated as a separator "parsining" >>> component, but that won't solve the "isolation" problem you are trying to >>> solve. The only current way it can be solved is pickling -- and we have >>> strictly decided against using pickling for DAGs. >>> >>> The idea in Statement (2) & (3) would help solve the isolation problem >>> in (1) and can be done with some work now. >>> >>> Happy to talk about it in more detail here or on call, the time Daniel >>> suggested works for me. >>> >>> Regards, >>> Kaxil >>> >>> On Wed, Apr 14, 2021 at 5:35 PM Daniel Imberman < >>> daniel.imber...@gmail.com> wrote: >>> >>>> How about Wednesday, April 21 at 8:00AM PST? >>>> >>>> On Wed, Apr 14, 2021 at 9:33 AM, Xinbin Huang <bin.huan...@gmail.com> >>>> wrote: >>>> >>>> I am available any days. >>>> >>>> On Wed, Apr 14, 2021, 9:32 AM Daniel Imberman < >>>> daniel.imber...@gmail.com> wrote: >>>> >>>>> Hi everyone! >>>>> >>>>> Would people be available around 8AM/9AM PST some point next week? I’m >>>>> in PST and Ian is UTC+1 so would be great to find a timezone that >>>>> accomodates everyone. >>>>> >>>>> Daniel >>>>> On Wed, Apr 14, 2021 at 6:26 AM, Ryan Hatter <ryannhat...@gmail.com> >>>>> wrote: >>>>> >>>>> I’d also like to be added please :) >>>>> >>>>> On Apr 13, 2021, at 21:27, Xinbin Huang <bin.huan...@gmail.com> wrote: >>>>> >>>>> >>>>> Hi Daniel & Ian, >>>>> >>>>> I am also interested in the idea of a serialization representation >>>>> that can be executed by workers directly. Can you also add me to the call? >>>>> >>>>> Thanks >>>>> Bin >>>>> >>>>> On Tue, Apr 13, 2021 at 2:49 PM Ian Buss <ianjb...@gmail.com> wrote: >>>>> >>>>>> Daniel, >>>>>> >>>>>> Thanks for your warm welcome and quick response and the advice on >>>>>> providers! Will certainly check out the examples you sent. >>>>>> >>>>>> 1. An "airflow register" command definitely sounds promising, would >>>>>> love to collaborate on an AIP there so let's set something up. >>>>>> 2. We use KubernetesExecutor exclusively as well. We've noticed >>>>>> significant additional load on the metadata DB as we scale up task pods >>>>>> so >>>>>> I've also thought about an API-based approach. Such an API could also >>>>>> open >>>>>> up the possibility of per-task security tokens which are injected by the >>>>>> scheduler, which should improve the security of such a system. Food for >>>>>> thought at least. I will start putting some of these thoughts down on >>>>>> paper >>>>>> in a sharable format. >>>>>> >>>>>> Ian >>>>>> >>>>>> On Tue, Apr 13, 2021 at 7:46 PM Daniel Imberman < >>>>>> daniel.imber...@gmail.com> wrote: >>>>>> >>>>>>> Hi Ian, >>>>>>> >>>>>>> >>>>>>> Firstly, welcome to the Airflow community :). I'm glad to hear >>>>>>> you've had a positive experience so far. It's great to hear that you >>>>>>> want >>>>>>> to contribute back, and I think that multi-tenancy/DAG isolation is a >>>>>>> pretty fantastic project for the community as a whole (a lot of things >>>>>>> are >>>>>>> are things we want but are limited by hours in a day). >>>>>>> >>>>>>> >>>>>>> 1. I've personally been kicking around some ideas lately about an >>>>>>> "airflow register" command that would write the DAG into the metadata >>>>>>> DB in >>>>>>> a way that could be "gettable" by the workers via the API. This work is >>>>>>> very early. I'd love to get some help on it. Perhaps we can set up a >>>>>>> zoom >>>>>>> chat to discuss drafting an AIP? >>>>>>> >>>>>>> >>>>>>> 2. Limiting worker access to the DB is not only good security >>>>>>> practice; it also opens up the door to a lot of valuable features. This >>>>>>> feature would be especially close to my heart as it would make the >>>>>>> KubernetesExecutor significantly more efficient. It should be possible >>>>>>> to >>>>>>> set up a system where the workers only ever speak to an API server and >>>>>>> never need to touch the DB. >>>>>>> >>>>>>> >>>>>>> 3. This is not something I personally have insight into, but I think >>>>>>> it sounds like a good idea. >>>>>>> >>>>>>> >>>>>>> Finally, addressing your question about a Cloudera provider. If >>>>>>> anything, it would probably give the provider _more_ legitimacy if you >>>>>>> hosted it under the Cloudera GitHub org (we very purposely created the >>>>>>> provider packages with this workflow in mind). There are multiple places >>>>>>> where we can work to surface this provider so it is easy to find and >>>>>>> use. >>>>>>> >>>>>>> >>>>>>> Astronomer has a pretty good sample provider here >>>>>>> <https://github.com/astronomer/airflow-provider-sample>. One >>>>>>> example of it running in the wild is the Great Expectations provider >>>>>>> here >>>>>>> <https://github.com/great-expectations/airflow-provider-great-expectations>. >>>>>>> I'd also be glad to get you in contact with people who have built >>>>>>> providers >>>>>>> in the past to help you with that process. >>>>>>> >>>>>>> >>>>>>> Looking forward to seeing some of these things come to fruition! >>>>>>> >>>>>>> >>>>>>> Daniel >>>>>>> >>>>>>> On Tue, Apr 13, 2021 at 9:43 AM, Ian Buss <ianjb...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> First a quick introduction: I'm an engineer with Cloudera working on >>>>>>> our Data Engineering product (CDE). Airflow is working great for us so >>>>>>> far. >>>>>>> We've been looking into how we can enhance the multi-tenancy story of >>>>>>> Apache Airflow as we currently deploy it. We have the following areas >>>>>>> which >>>>>>> we'd like (with community consensus) to work on and contribute back to >>>>>>> Apache Airflow to enhance the isolation between tenants in a single >>>>>>> Airflow >>>>>>> deployment. >>>>>>> >>>>>>> 1. Isolating code execution and parsing of DAG files. At the moment, >>>>>>> DAG files are parsed in a few locations in Airflow, including the >>>>>>> scheduler >>>>>>> and in tasks. There is already the concept of DAG serialization (and >>>>>>> we're >>>>>>> using that for the web component) but we'd be interested to see if we >>>>>>> can >>>>>>> sandbox the execution of arbitrary user code to a locked down >>>>>>> process/container without full access to the metadata DB and connection >>>>>>> secrets etc. The idea would be to parse and serialize the DAG in this >>>>>>> isolated container and pass back a serialized representation for >>>>>>> persistence in the DB. Has anyone explored this idea? >>>>>>> >>>>>>> 2. Limiting task access to the metadata DB. It would be great if we >>>>>>> could remove the requirement for tasks to have full access to the >>>>>>> metadata >>>>>>> DB and to report task status in a different (but still scalable) way. >>>>>>> We'd >>>>>>> need to tackle access or injection of connection, variable and xcom >>>>>>> data as >>>>>>> well for each task naturally. >>>>>>> >>>>>>> 3. Finer-grained access controls on connection secrets. Right now, >>>>>>> although there are nice at-rest encryption options with Fernet or Vault, >>>>>>> IIUC any DAG can access any connection (and thus any secret). Since the >>>>>>> "run as" user is largely defined within the DAG and its tasks, this is >>>>>>> challenging for a multi-tenant environment (see caveat below) >>>>>>> >>>>>>> Caveat: It's definitely noted that to some extent we should assume >>>>>>> that an Airflow deployment is a "trusted" environment and that best >>>>>>> practices such as git+PR workflows are the gold standard and that any >>>>>>> malicious code and dependencies should be identified through this >>>>>>> process. >>>>>>> Also that there is a clear admin role for connection management etc. >>>>>>> >>>>>>> We have some ideas informally sketched out as to how to address the >>>>>>> above but would be keen to hear the community opinion on this and to >>>>>>> see if >>>>>>> anyone is keen to collaborate on designs and implementation, or to hear >>>>>>> if >>>>>>> anything is already in the works. In particular I noticed that the very >>>>>>> first improvement proposal (AIP-1) addresses much of the above :). >>>>>>> However, >>>>>>> it seems fairly dormant at the moment. >>>>>>> >>>>>>> One other question: we have a provider (operators and hooks) for >>>>>>> interacting with Cloudera components that we'd like to contribute to the >>>>>>> project. The provider FAQs indicate that new provider contributions are >>>>>>> still welcome in the project in 2.x, is that accurate? >>>>>>> >>>>>>> Thanks in advance! >>>>>>> >>>>>>> Ian >>>>>>> >>>>>>>