Re: [DISCUSSION] Enhanced Multi-Tenant Dataset Management in Airflow: Potential First Steps

Jarek Potiuk Tue, 23 Jan 2024 05:13:49 -0800

Clarifying: There is no (and it has never been) a problem with opening up
submitting "structured" DAGs.


On Tue, Jan 23, 2024 at 2:12 PM Jarek Potiuk <ja...@potiuk.com> wrote:

> > I always assumed that this was the reason why it's impossible to create
> dags from API, no one wanted to open this particular can of worms. I think
> if you need to synchronize these objects, the cleaner way would be to
> describe them in some sort of a shared config file and let respective
> dag-processors create them independently of each other.
>
> Just to clarify this one: - creating DAGs via API has been resented mostly
> because of security reasons - where you would want to submit Python DAG
> code via API. There is (and it has never been) a problem with opening up
> submitting "structured" DAGs. This has never been implemented, but if you
> would like to limit to just modifying or creating resulting DAG structure,
> that would be possible - for example there is no fundamental problem with
> generating a DAG from (say) visual representation and submitting a
> resulting DAG structure without creating a DAG python file (so essentially
> playing the role of DAG file processor to serialize DAGs). It would have a
> number of limitations (for example callbacks would not work., timetables
> would be a challenge etc.), but other than that it's quite possible (and
> possibly even in the future we might have something like that).
>
> Following that - there are no fundamental problems with submitting
> datasets - because they are not Python code, they are pure "metadata"
> objects.
>
> Still the questions remains how it plays with the DAG-created datasets is
> an important aspect of the proposal.
>
> J.
>
>
> On Tue, Jan 23, 2024 at 2:01 PM Tornike Gurgenidze <
> togur...@freeuni.edu.ge> wrote:
>
>> Maybe I'm missing something, but I can't see how rest endpoints for
>> datasets could work in practice. afaik, Airflow has some objects that cab
>> be created by a dag processor (Dags, Datasets) and others that can be
>> created with API/UI (Connections, Variables), but never both at the same
>> time. How would update/delete endpoints work if Dataset was initially
>> created declaratively from a dag file? Would it throw an exception or make
>> an update that will then be reverted in a little while by a dag-processor
>> anyway?
>>
>> I always assumed that this was the reason why it's impossible to create
>> dags from API, no one wanted to open this particular can of worms. I think
>> if you need to synchronize these objects, the cleaner way would be to
>> describe them in some sort of a shared config file and let respective
>> dag-processors create them independently of each other.
>>
>> On Tue, Jan 23, 2024 at 4:02 PM Jarek Potiuk <ja...@potiuk.com> wrote:
>>
>> > I am also pretty cool with adding/updating/datasets externally, however
>> I
>> > know there are some ongoing discussions on how to improve/change
>> datasets
>> > and bind them together with multiple other features of Airflow - not
>> sure
>> > what the state of those, but it would be great those effort are
>> coordinated
>> > so that we are not pulling stuff in multiple directions.
>> >
>> > From what I've heard/overheard noticed about Datasets are those things:
>> >
>> > * AIP-60  -
>> >
>> >
>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-60+Standard+URI+representation+for+Airflow+Datasets
>> > - already almost passed
>> > * Better coupling of datasets with OpenLineage
>> > * Partial datasets - allowing to have datasets with data intervals
>> > * Triggering dags on external dataset changes
>> > * Objects Storage integration with datasets
>> >
>> > All of which sound very promising and are definitely important for
>> Dataset
>> > usage.
>> >
>> > So I think we really make sure when we are doing anything with datasets,
>> > the people who think/work on those aspects above have a say in those
>> > proposals/discussions - it would be a shame if we add something that
>> will
>> > partially invalidate or make terribly complex to implement some of the
>> > other things.
>> >
>> > I am not saying it's the case here, I am just saying that we should at
>> > least make sure that people who are currently thinking about these
>> things
>> > don't come surprised if we merge something that will make their job
>> harder.
>> >
>> > I am a little surprised - knowing the *thinking* happening in dataset
>> area
>> > that I am aware of that there are so little comments on that one (even
>> if
>> > "hey looks cool" - works well for the things I am thinking about) :).
>> >
>> > J.
>> >
>> >
>> >
>> >
>> > On Tue, Jan 23, 2024 at 3:53 AM Ryan Hatter
>> > <ryan.hat...@astronomer.io.invalid> wrote:
>> >
>> > > I don't think it makes sense to include the create endpoint without
>> also
>> > > including dataset update and delete endpoints and updating the
>> Datasets
>> > > view in the UI to be able to manage externally created Datasets.
>> > >
>> > > With that said, I don't think the fact that Datasets are tightly
>> coupled
>> > > with DAGs is a good reason not to include additional Dataset
>> endpoints.
>> > It
>> > > makes sense to me to be able to interact with Datasets from outside of
>> > > Airflow.
>> > >
>> > > On Sat, Jan 20, 2024 at 6:13 AM Eduardo Nicastro <
>> edu.nicas...@gmail.com
>> > >
>> > > wrote:
>> > >
>> > > > Hello all, I have created a Pull Request (
>> > > > https://github.com/apache/airflow/pull/36929) to make it possible
>> to
>> > > > create a dataset through the API as a modest step forward. This PR
>> is
>> > > open
>> > > > for your feedback. I'm preparing another PR to build upon the
>> insights
>> > > from
>> > > > https://github.com/apache/airflow/pull/29433. Your thoughts and
>> > > > contributions are highly encouraged.
>> > > >
>> > > > Best Regards,
>> > > > Eduardo Nicastro
>> > > >
>> > > > On Thu, Jan 11, 2024 at 4:30 PM Eduardo Nicastro <
>> > edu.nicas...@gmail.com
>> > > >
>> > > > wrote:
>> > > >
>> > > >> Hello all,
>> > > >>
>> > > >> I'm reaching out to propose a topic for discussion that has
>> recently
>> > > >> emerged in our GitHub discussion threads (#36723
>> > > >> <https://github.com/apache/airflow/discussions/36723>). It
>> revolves
>> > > >> around enhancing the management of datasets in a multi-tenant
>> Airflow
>> > > >> architecture.
>> > > >>
>> > > >> Use case/motivation
>> > > >> In our multi-instance setup, synchronizing dataset dependencies
>> across
>> > > >> instances poses significant challenges. With the advent of dataset
>> > > >> listeners, a new door has opened for cross-instance dataset
>> > awareness. I
>> > > >> propose we explore creating endpoints to export dataset updates to
>> > make
>> > > it
>> > > >> possible to trigger DAGs consuming from a Dataset across tenants.
>> > > >>
>> > > >> Context
>> > > >> Below I will give some context about our current situation and
>> > solution
>> > > >> we have in place and propose a new workflow that would be more
>> > > efficient.
>> > > >> To be able to implement this new workflow we would need a way to
>> > export
>> > > >> Dataset updates as mentioned.
>> > > >>
>> > > >> Current Workflow
>> > > >> In our organization, we're dealing with multiple Airflow tenants,
>> > let's
>> > > >> say Tenant 1 and Tenant 2, as examples. To synchronize Dataset A
>> > across
>> > > >> these tenants, we currently have a complex setup:
>> > > >>
>> > > >>    1. Containers run on a schedule to export metadata to CosmosDB
>> > (these
>> > > >>    will be replaced by the listener).
>> > > >>    2. Additional scheduled containers pull data from CosmosDB and
>> > write
>> > > >>    it to a shared file system, enabling generated DAGS to read it
>> and
>> > > mirror a
>> > > >>    dataset across tenants.
>> > > >>
>> > > >>
>> > > >> Proposed Workflow
>> > > >> Here's a breakdown of our proposed workflow:
>> > > >>
>> > > >>    1. Cross-Tenant Dataset Interaction: We have Dags in Tenant 1
>> > > >>    producing Dataset A. We need a mechanism to trigger all Dags
>> > > consuming
>> > > >>    Dataset A in Tenant 2. This interaction is crucial for our data
>> > > pipeline's
>> > > >>    efficiency and synchronicity.
>> > > >>    2. Dataset Listener Implementation: Our approach involves
>> > > >>    implementing a Dataset listener that programmatically creates
>> > > Dataset A in
>> > > >>    all tenants where it's not present (like Tenant 2) and export
>> > Dataset
>> > > >>    updates when they happen. This would trigger an update on all
>> Dags
>> > > >>    consuming from that Dataset.
>> > > >>    3. Standardized Dataset Names: We plan to use standardized
>> dataset
>> > > >>    names, which makes sense since a URI is its identifier and
>> > > uniqueness is a
>> > > >>    logical requirement.
>> > > >>
>> > > >> [image: image.png]
>> > > >>
>> > > >> Why This Matters:
>> > > >>
>> > > >>    - It offers a streamlined, automated way to manage datasets
>> across
>> > > >>    different Airflow instances.
>> > > >>    - It aligns with a need for efficient, interconnected workflows
>> in
>> > a
>> > > >>    multi-tenant environment.
>> > > >>
>> > > >>
>> > > >> I invite the community to discuss:
>> > > >>
>> > > >>    - Are there alternative methods within Airflow's current
>> framework
>> > > >>    that could achieve similar goals?
>> > > >>    - Any insights or experiences that could inform our approach?
>> > > >>
>> > > >> Your feedback and suggestions are invaluable, and I look forward
>> to a
>> > > >> collaborative discussion.
>> > > >>
>> > > >> Best Regards,
>> > > >> Eduardo Nicastro
>> > > >>
>> > > >
>> > >
>> >
>>
>

Re: [DISCUSSION] Enhanced Multi-Tenant Dataset Management in Airflow: Potential First Steps

Reply via email to