Hi all,

I actually can’t do Wednesday next week as I’m moving house :) Any chance we 
could do Thursday or Friday at the same time?

Cheers

Ian
On 14 Apr 2021, 17:49 +0100, Kaxil Naik <kaxiln...@gmail.com>, wrote:
> Just few comments here:
>
> Currently -- atleast for the foreseeable future Airflow workers will need 
> access to the DAG Files, so workers can not run using the Serialized DAGs.
>
> Also serialized DAGs do not even have all the info needed for it to run it. 
> Currently the serialization happens in the parsing process in the scheduler 
> which can be in future separated as a separator "parsining" component, but 
> that won't solve the "isolation" problem you are trying to solve. The only 
> current way it can be solved is pickling -- and we have strictly decided 
> against using pickling for DAGs.
>
> The idea in Statement (2) & (3) would help solve the isolation problem in (1) 
> and can be done with some work now.
>
> Happy to talk about it in more detail here or on call, the time Daniel 
> suggested works for me.
>
> Regards,
> Kaxil
>
> > On Wed, Apr 14, 2021 at 5:35 PM Daniel Imberman <daniel.imber...@gmail.com> 
> > wrote:
> > > How about Wednesday, April 21 at 8:00AM PST?
> > >
> > > On Wed, Apr 14, 2021 at 9:33 AM, Xinbin Huang <bin.huan...@gmail.com> 
> > > wrote:
> > > > I am available any days.
> > > >
> > > > > On Wed, Apr 14, 2021, 9:32 AM Daniel Imberman 
> > > > > <daniel.imber...@gmail.com> wrote:
> > > > > > Hi everyone!
> > > > > >
> > > > > > Would people be available around 8AM/9AM PST some point next week? 
> > > > > > I’m in PST and Ian is UTC+1 so would be great to find a timezone 
> > > > > > that accomodates everyone.
> > > > > >
> > > > > > Daniel
> > > > > > On Wed, Apr 14, 2021 at 6:26 AM, Ryan Hatter 
> > > > > > <ryannhat...@gmail.com> wrote:
> > > > > > > I’d also like to be added please :)
> > > > > > >
> > > > > > > > On Apr 13, 2021, at 21:27, Xinbin Huang <bin.huan...@gmail.com> 
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > Hi Daniel & Ian,
> > > > > > > >
> > > > > > > > I am also interested in the idea of a serialization 
> > > > > > > > representation that can be executed by workers directly. Can 
> > > > > > > > you also add me to the call?
> > > > > > > >
> > > > > > > > Thanks
> > > > > > > > Bin
> > > > > > > >
> > > > > > > > > On Tue, Apr 13, 2021 at 2:49 PM Ian Buss <ianjb...@gmail.com> 
> > > > > > > > > wrote:
> > > > > > > > > > Daniel,
> > > > > > > > > >
> > > > > > > > > > Thanks for your warm welcome and quick response and the 
> > > > > > > > > > advice on providers! Will certainly check out the examples 
> > > > > > > > > > you sent.
> > > > > > > > > >
> > > > > > > > > > 1. An "airflow register" command definitely sounds 
> > > > > > > > > > promising, would love to collaborate on an AIP there so 
> > > > > > > > > > let's set something up.
> > > > > > > > > > 2. We use KubernetesExecutor exclusively as well. We've 
> > > > > > > > > > noticed significant additional load on the metadata DB as 
> > > > > > > > > > we scale up task pods so I've also thought about an 
> > > > > > > > > > API-based approach. Such an API could also open up the 
> > > > > > > > > > possibility of per-task security tokens which are injected 
> > > > > > > > > > by the scheduler, which should improve the security of such 
> > > > > > > > > > a system. Food for thought at least. I will start putting 
> > > > > > > > > > some of these thoughts down on paper in a sharable format.
> > > > > > > > > >
> > > > > > > > > > Ian
> > > > > > > > > >
> > > > > > > > > > > On Tue, Apr 13, 2021 at 7:46 PM Daniel Imberman 
> > > > > > > > > > > <daniel.imber...@gmail.com> wrote:
> > > > > > > > > > > > Hi Ian,
> > > > > > > > > > > >
> > > > > > > > > > > > Firstly, welcome to the Airflow community :). I'm glad 
> > > > > > > > > > > > to hear you've had a positive experience so far. It's 
> > > > > > > > > > > > great to hear that you want to contribute back, and I 
> > > > > > > > > > > > think that multi-tenancy/DAG isolation is a pretty 
> > > > > > > > > > > > fantastic project for the community as a whole (a lot 
> > > > > > > > > > > > of things are are things we want but are limited by 
> > > > > > > > > > > > hours in a day).
> > > > > > > > > > > >
> > > > > > > > > > > > 1. I've personally been kicking around some ideas 
> > > > > > > > > > > > lately about an "airflow register" command that would 
> > > > > > > > > > > > write the DAG into the metadata DB in a way that could 
> > > > > > > > > > > > be "gettable" by the workers via the API. This work is 
> > > > > > > > > > > > very early. I'd love to get some help on it. Perhaps we 
> > > > > > > > > > > > can set up a zoom chat to discuss drafting an AIP?
> > > > > > > > > > > >
> > > > > > > > > > > > 2. Limiting worker access to the DB is not only good 
> > > > > > > > > > > > security practice; it also opens up the door to a lot 
> > > > > > > > > > > > of valuable features. This feature would be especially 
> > > > > > > > > > > > close to my heart as it would make the 
> > > > > > > > > > > > KubernetesExecutor significantly more efficient. It 
> > > > > > > > > > > > should be possible to set up a system where the workers 
> > > > > > > > > > > > only ever speak to an API server and never need to 
> > > > > > > > > > > > touch the DB.
> > > > > > > > > > > >
> > > > > > > > > > > > 3. This is not something I personally have insight 
> > > > > > > > > > > > into, but I think it sounds like a good idea.
> > > > > > > > > > > >
> > > > > > > > > > > > Finally, addressing your question about a Cloudera 
> > > > > > > > > > > > provider. If anything, it would probably give the 
> > > > > > > > > > > > provider _more_ legitimacy if you hosted it under the 
> > > > > > > > > > > > Cloudera GitHub org (we very purposely created the 
> > > > > > > > > > > > provider packages with this workflow in mind). There 
> > > > > > > > > > > > are multiple places where we can work to surface this 
> > > > > > > > > > > > provider so it is easy to find and use.
> > > > > > > > > > > >
> > > > > > > > > > > > Astronomer has a pretty good sample provider here. One 
> > > > > > > > > > > > example of it running in the wild is the Great 
> > > > > > > > > > > > Expectations provider here. I'd also be glad to get you 
> > > > > > > > > > > > in contact with people who have built providers in the 
> > > > > > > > > > > > past to help you with that process.
> > > > > > > > > > > >
> > > > > > > > > > > > Looking forward to seeing some of these things come to 
> > > > > > > > > > > > fruition!
> > > > > > > > > > > >
> > > > > > > > > > > > Daniel
> > > > > > > > > > > >
> > > > > > > > > > > > On Tue, Apr 13, 2021 at 9:43 AM, Ian Buss 
> > > > > > > > > > > > <ianjb...@gmail.com> wrote:
> > > > > > > > > > > > > Hi all,
> > > > > > > > > > > > >
> > > > > > > > > > > > > First a quick introduction: I'm an engineer with 
> > > > > > > > > > > > > Cloudera working on our Data Engineering product 
> > > > > > > > > > > > > (CDE). Airflow is working great for us so far. We've 
> > > > > > > > > > > > > been looking into how we can enhance the 
> > > > > > > > > > > > > multi-tenancy story of Apache Airflow as we currently 
> > > > > > > > > > > > > deploy it. We have the following areas which we'd 
> > > > > > > > > > > > > like (with community consensus) to work on and 
> > > > > > > > > > > > > contribute back to Apache Airflow to enhance the 
> > > > > > > > > > > > > isolation between tenants in a single Airflow 
> > > > > > > > > > > > > deployment.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1. Isolating code execution and parsing of DAG files. 
> > > > > > > > > > > > > At the moment, DAG files are parsed in a few 
> > > > > > > > > > > > > locations in Airflow, including the scheduler and in 
> > > > > > > > > > > > > tasks. There is already the concept of DAG 
> > > > > > > > > > > > > serialization (and we're using that for the web 
> > > > > > > > > > > > > component) but we'd be interested to see if we can 
> > > > > > > > > > > > > sandbox the execution of arbitrary user code to a 
> > > > > > > > > > > > > locked down process/container without full access to 
> > > > > > > > > > > > > the metadata DB and connection secrets etc. The idea 
> > > > > > > > > > > > > would be to parse and serialize the DAG in this 
> > > > > > > > > > > > > isolated container and pass back a serialized 
> > > > > > > > > > > > > representation for persistence in the DB. Has anyone 
> > > > > > > > > > > > > explored this idea?
> > > > > > > > > > > > >
> > > > > > > > > > > > > 2. Limiting task access to the metadata DB. It would 
> > > > > > > > > > > > > be great if we could remove the requirement for tasks 
> > > > > > > > > > > > > to have full access to the metadata DB and to report 
> > > > > > > > > > > > > task status in a different (but still scalable) way. 
> > > > > > > > > > > > > We'd need to tackle access or injection of 
> > > > > > > > > > > > > connection, variable and xcom data as well for each 
> > > > > > > > > > > > > task naturally.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 3. Finer-grained access controls on connection 
> > > > > > > > > > > > > secrets. Right now, although there are nice at-rest 
> > > > > > > > > > > > > encryption options with Fernet or Vault, IIUC any DAG 
> > > > > > > > > > > > > can access any connection (and thus any secret). 
> > > > > > > > > > > > > Since the "run as" user is largely defined within the 
> > > > > > > > > > > > > DAG and its tasks, this is challenging for a 
> > > > > > > > > > > > > multi-tenant environment (see caveat below)
> > > > > > > > > > > > >
> > > > > > > > > > > > > Caveat: It's definitely noted that to some extent we 
> > > > > > > > > > > > > should assume that an Airflow deployment is a 
> > > > > > > > > > > > > "trusted" environment and that best practices such as 
> > > > > > > > > > > > > git+PR workflows are the gold standard and that any 
> > > > > > > > > > > > > malicious code and dependencies should be identified 
> > > > > > > > > > > > > through this process. Also that there is a clear 
> > > > > > > > > > > > > admin role for connection management etc.
> > > > > > > > > > > > >
> > > > > > > > > > > > > We have some ideas informally sketched out as to how 
> > > > > > > > > > > > > to address the above but would be keen to hear the 
> > > > > > > > > > > > > community opinion on this and to see if anyone is 
> > > > > > > > > > > > > keen to collaborate on designs and implementation, or 
> > > > > > > > > > > > > to hear if anything is already in the works. In 
> > > > > > > > > > > > > particular I noticed that the very first improvement 
> > > > > > > > > > > > > proposal (AIP-1) addresses much of the above :). 
> > > > > > > > > > > > > However, it seems fairly dormant at the moment.
> > > > > > > > > > > > >
> > > > > > > > > > > > > One other question: we have a provider (operators and 
> > > > > > > > > > > > > hooks) for interacting with Cloudera components that 
> > > > > > > > > > > > > we'd like to contribute to the project. The provider 
> > > > > > > > > > > > > FAQs indicate that new provider contributions are 
> > > > > > > > > > > > > still welcome in the project in 2.x, is that accurate?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks in advance!
> > > > > > > > > > > > >
> > > > > > > > > > > > > Ian

Reply via email to