I'll join as well (I believe the zoom link will work without an invite)

On Wed, Apr 21, 2021 at 10:48 AM Dimitris Stafylarakis <xan...@gmail.com>
wrote:

> hi all,
>
> great to read about this, I'd like to join in! Can I just join using the
> zoom link tomorrow or do I need an invitation? (If I do need one, please
> invite me :))
>
> cheers
>
>
> On Wed, Apr 14, 2021 at 8:15 PM Daniel Imberman <daniel.imber...@gmail.com>
> wrote:
>
>> Thank you Ian,
>>
>> I’ve invited everyone on this thread to the meeting with that zoom link.
>> Anyone else who wants to join can add the calendar event here
>> calendar.google.com/event?action=TEMPLATE&tmeid=Mm4zN2Q3MnFwNnBqbW9hMmNocXMyNzJpdHYgZGFuaWVsQGFzdHJvbm9tZXIuaW8&tmsrc=dan...@astronomer.io
>> <https://calendar.google.com/event?action=TEMPLATE&tmeid=Mm4zN2Q3MnFwNnBqbW9hMmNocXMyNzJpdHYgZGFuaWVsQGFzdHJvbm9tZXIuaW8&tmsrc=daniel%40astronomer.io>
>>
>> On Wed, Apr 14, 2021 at 11:05 AM, Ian Buss <ianjb...@gmail.com> wrote:
>>
>> If this works for everyone, here's a zoom link for Thursday 8AM PST:
>> https://cloudera.zoom.us/j/99928254235?pwd=VTFlQk4vQjQ5Z2JzUDM3ZWZKKy9MQT09
>>
>> Happy to move or use an alternate method as needed.
>>
>> On Wed, Apr 14, 2021 at 6:58 PM Daniel Imberman <
>> daniel.imber...@gmail.com> wrote:
>>
>>> Thursday works for me!
>>>
>>> On Wed, Apr 14, 2021 at 10:05 AM, Ian Buss <ianjb...@gmail.com> wrote:
>>>
>>> Hi all,
>>>
>>> I actually can’t do Wednesday next week as I’m moving house :) Any
>>> chance we could do Thursday or Friday at the same time?
>>>
>>> Cheers
>>>
>>> Ian
>>> On 14 Apr 2021, 17:49 +0100, Kaxil Naik <kaxiln...@gmail.com>, wrote:
>>>
>>> Just few comments here:
>>>
>>> Currently -- atleast for the foreseeable future Airflow workers will
>>> need access to the DAG Files, so workers can not run using the Serialized
>>> DAGs.
>>>
>>> Also serialized DAGs do not even have all the info needed for it to run
>>> it. Currently the serialization happens in the parsing process in the
>>> scheduler which can be in future separated as a separator "parsining"
>>> component, but that won't solve the "isolation" problem you are trying to
>>> solve. The only current way it can be solved is pickling -- and we have
>>> strictly decided against using pickling for DAGs.
>>>
>>> The idea in Statement (2) & (3) would help solve the isolation problem
>>> in (1) and can be done with some work now.
>>>
>>> Happy to talk about it in more detail here or on call, the time Daniel
>>> suggested works for me.
>>>
>>> Regards,
>>> Kaxil
>>>
>>> On Wed, Apr 14, 2021 at 5:35 PM Daniel Imberman <
>>> daniel.imber...@gmail.com> wrote:
>>>
>>>> How about Wednesday, April 21 at 8:00AM PST?
>>>>
>>>> On Wed, Apr 14, 2021 at 9:33 AM, Xinbin Huang <bin.huan...@gmail.com>
>>>> wrote:
>>>>
>>>> I am available any days.
>>>>
>>>> On Wed, Apr 14, 2021, 9:32 AM Daniel Imberman <
>>>> daniel.imber...@gmail.com> wrote:
>>>>
>>>>> Hi everyone!
>>>>>
>>>>> Would people be available around 8AM/9AM PST some point next week? I’m
>>>>> in PST and Ian is UTC+1 so would be great to find a timezone that
>>>>> accomodates everyone.
>>>>>
>>>>> Daniel
>>>>> On Wed, Apr 14, 2021 at 6:26 AM, Ryan Hatter <ryannhat...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> I’d also like to be added please :)
>>>>>
>>>>> On Apr 13, 2021, at 21:27, Xinbin Huang <bin.huan...@gmail.com> wrote:
>>>>>
>>>>> 
>>>>> Hi Daniel & Ian,
>>>>>
>>>>> I am also interested in the idea of a serialization representation
>>>>> that can be executed by workers directly. Can you also add me to the call?
>>>>>
>>>>> Thanks
>>>>> Bin
>>>>>
>>>>> On Tue, Apr 13, 2021 at 2:49 PM Ian Buss <ianjb...@gmail.com> wrote:
>>>>>
>>>>>> Daniel,
>>>>>>
>>>>>> Thanks for your warm welcome and quick response and the advice on
>>>>>> providers! Will certainly check out the examples you sent.
>>>>>>
>>>>>> 1. An "airflow register" command definitely sounds promising, would
>>>>>> love to collaborate on an AIP there so let's set something up.
>>>>>> 2. We use KubernetesExecutor exclusively as well. We've noticed
>>>>>> significant additional load on the metadata DB as we scale up task pods 
>>>>>> so
>>>>>> I've also thought about an API-based approach. Such an API could also 
>>>>>> open
>>>>>> up the possibility of per-task security tokens which are injected by the
>>>>>> scheduler, which should improve the security of such a system. Food for
>>>>>> thought at least. I will start putting some of these thoughts down on 
>>>>>> paper
>>>>>> in a sharable format.
>>>>>>
>>>>>> Ian
>>>>>>
>>>>>> On Tue, Apr 13, 2021 at 7:46 PM Daniel Imberman <
>>>>>> daniel.imber...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Ian,
>>>>>>>
>>>>>>>
>>>>>>> Firstly, welcome to the Airflow community :). I'm glad to hear
>>>>>>> you've had a positive experience so far. It's great to hear that you 
>>>>>>> want
>>>>>>> to contribute back, and I think that multi-tenancy/DAG isolation is a
>>>>>>> pretty fantastic project for the community as a whole (a lot of things 
>>>>>>> are
>>>>>>> are things we want but are limited by hours in a day).
>>>>>>>
>>>>>>>
>>>>>>> 1. I've personally been kicking around some ideas lately about an
>>>>>>> "airflow register" command that would write the DAG into the metadata 
>>>>>>> DB in
>>>>>>> a way that could be "gettable" by the workers via the API. This work is
>>>>>>> very early. I'd love to get some help on it. Perhaps we can set up a 
>>>>>>> zoom
>>>>>>> chat to discuss drafting an AIP?
>>>>>>>
>>>>>>>
>>>>>>> 2. Limiting worker access to the DB is not only good security
>>>>>>> practice; it also opens up the door to a lot of valuable features. This
>>>>>>> feature would be especially close to my heart as it would make the
>>>>>>> KubernetesExecutor significantly more efficient. It should be possible 
>>>>>>> to
>>>>>>> set up a system where the workers only ever speak to an API server and
>>>>>>> never need to touch the DB.
>>>>>>>
>>>>>>>
>>>>>>> 3. This is not something I personally have insight into, but I think
>>>>>>> it sounds like a good idea.
>>>>>>>
>>>>>>>
>>>>>>> Finally, addressing your question about a Cloudera provider. If
>>>>>>> anything, it would probably give the provider _more_ legitimacy if you
>>>>>>> hosted it under the Cloudera GitHub org (we very purposely created the
>>>>>>> provider packages with this workflow in mind). There are multiple places
>>>>>>> where we can work to surface this provider so it is easy to find and 
>>>>>>> use.
>>>>>>>
>>>>>>>
>>>>>>> Astronomer has a pretty good sample provider here
>>>>>>> <https://github.com/astronomer/airflow-provider-sample>. One
>>>>>>> example of it running in the wild is the Great Expectations provider
>>>>>>> here
>>>>>>> <https://github.com/great-expectations/airflow-provider-great-expectations>.
>>>>>>> I'd also be glad to get you in contact with people who have built 
>>>>>>> providers
>>>>>>> in the past to help you with that process.
>>>>>>>
>>>>>>>
>>>>>>> Looking forward to seeing some of these things come to fruition!
>>>>>>>
>>>>>>>
>>>>>>> Daniel
>>>>>>>
>>>>>>> On Tue, Apr 13, 2021 at 9:43 AM, Ian Buss <ianjb...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> First a quick introduction: I'm an engineer with Cloudera working on
>>>>>>> our Data Engineering product (CDE). Airflow is working great for us so 
>>>>>>> far.
>>>>>>> We've been looking into how we can enhance the multi-tenancy story of
>>>>>>> Apache Airflow as we currently deploy it. We have the following areas 
>>>>>>> which
>>>>>>> we'd like (with community consensus) to work on and contribute back to
>>>>>>> Apache Airflow to enhance the isolation between tenants in a single 
>>>>>>> Airflow
>>>>>>> deployment.
>>>>>>>
>>>>>>> 1. Isolating code execution and parsing of DAG files. At the moment,
>>>>>>> DAG files are parsed in a few locations in Airflow, including the 
>>>>>>> scheduler
>>>>>>> and in tasks. There is already the concept of DAG serialization (and 
>>>>>>> we're
>>>>>>> using that for the web component) but we'd be interested to see if we 
>>>>>>> can
>>>>>>> sandbox the execution of arbitrary user code to a locked down
>>>>>>> process/container without full access to the metadata DB and connection
>>>>>>> secrets etc. The idea would be to parse and serialize the DAG in this
>>>>>>> isolated container and pass back a serialized representation for
>>>>>>> persistence in the DB. Has anyone explored this idea?
>>>>>>>
>>>>>>> 2. Limiting task access to the metadata DB. It would be great if we
>>>>>>> could remove the requirement for tasks to have full access to the 
>>>>>>> metadata
>>>>>>> DB and to report task status in a different (but still scalable) way. 
>>>>>>> We'd
>>>>>>> need to tackle access or injection of connection, variable and xcom 
>>>>>>> data as
>>>>>>> well for each task naturally.
>>>>>>>
>>>>>>> 3. Finer-grained access controls on connection secrets. Right now,
>>>>>>> although there are nice at-rest encryption options with Fernet or Vault,
>>>>>>> IIUC any DAG can access any connection (and thus any secret). Since the
>>>>>>> "run as" user is largely defined within the DAG and its tasks, this is
>>>>>>> challenging for a multi-tenant environment (see caveat below)
>>>>>>>
>>>>>>> Caveat: It's definitely noted that to some extent we should assume
>>>>>>> that an Airflow deployment is a "trusted" environment and that best
>>>>>>> practices such as git+PR workflows are the gold standard and that any
>>>>>>> malicious code and dependencies should be identified through this 
>>>>>>> process.
>>>>>>> Also that there is a clear admin role for connection management etc.
>>>>>>>
>>>>>>> We have some ideas informally sketched out as to how to address the
>>>>>>> above but would be keen to hear the community opinion on this and to 
>>>>>>> see if
>>>>>>> anyone is keen to collaborate on designs and implementation, or to hear 
>>>>>>> if
>>>>>>> anything is already in the works. In particular I noticed that the very
>>>>>>> first improvement proposal (AIP-1) addresses much of the above :). 
>>>>>>> However,
>>>>>>> it seems fairly dormant at the moment.
>>>>>>>
>>>>>>> One other question: we have a provider (operators and hooks) for
>>>>>>> interacting with Cloudera components that we'd like to contribute to the
>>>>>>> project. The provider FAQs indicate that new provider contributions are
>>>>>>> still welcome in the project in 2.x, is that accurate?
>>>>>>>
>>>>>>> Thanks in advance!
>>>>>>>
>>>>>>> Ian
>>>>>>>
>>>>>>>

Reply via email to