Maybe I'm missing something, but I can't see how rest endpoints for
datasets could work in practice. afaik, Airflow has some objects that cab
be created by a dag processor (Dags, Datasets) and others that can be
created with API/UI (Connections, Variables), but never both at the same
time. How would update/delete endpoints work if Dataset was initially
created declaratively from a dag file? Would it throw an exception or make
an update that will then be reverted in a little while by a dag-processor
anyway?

I always assumed that this was the reason why it's impossible to create
dags from API, no one wanted to open this particular can of worms. I think
if you need to synchronize these objects, the cleaner way would be to
describe them in some sort of a shared config file and let respective
dag-processors create them independently of each other.

On Tue, Jan 23, 2024 at 4:02 PM Jarek Potiuk <ja...@potiuk.com> wrote:

> I am also pretty cool with adding/updating/datasets externally, however I
> know there are some ongoing discussions on how to improve/change datasets
> and bind them together with multiple other features of Airflow - not sure
> what the state of those, but it would be great those effort are coordinated
> so that we are not pulling stuff in multiple directions.
>
> From what I've heard/overheard noticed about Datasets are those things:
>
> * AIP-60  -
>
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-60+Standard+URI+representation+for+Airflow+Datasets
> - already almost passed
> * Better coupling of datasets with OpenLineage
> * Partial datasets - allowing to have datasets with data intervals
> * Triggering dags on external dataset changes
> * Objects Storage integration with datasets
>
> All of which sound very promising and are definitely important for Dataset
> usage.
>
> So I think we really make sure when we are doing anything with datasets,
> the people who think/work on those aspects above have a say in those
> proposals/discussions - it would be a shame if we add something that will
> partially invalidate or make terribly complex to implement some of the
> other things.
>
> I am not saying it's the case here, I am just saying that we should at
> least make sure that people who are currently thinking about these things
> don't come surprised if we merge something that will make their job harder.
>
> I am a little surprised - knowing the *thinking* happening in dataset area
> that I am aware of that there are so little comments on that one (even if
> "hey looks cool" - works well for the things I am thinking about) :).
>
> J.
>
>
>
>
> On Tue, Jan 23, 2024 at 3:53 AM Ryan Hatter
> <ryan.hat...@astronomer.io.invalid> wrote:
>
> > I don't think it makes sense to include the create endpoint without also
> > including dataset update and delete endpoints and updating the Datasets
> > view in the UI to be able to manage externally created Datasets.
> >
> > With that said, I don't think the fact that Datasets are tightly coupled
> > with DAGs is a good reason not to include additional Dataset endpoints.
> It
> > makes sense to me to be able to interact with Datasets from outside of
> > Airflow.
> >
> > On Sat, Jan 20, 2024 at 6:13 AM Eduardo Nicastro <edu.nicas...@gmail.com
> >
> > wrote:
> >
> > > Hello all, I have created a Pull Request (
> > > https://github.com/apache/airflow/pull/36929) to make it possible to
> > > create a dataset through the API as a modest step forward. This PR is
> > open
> > > for your feedback. I'm preparing another PR to build upon the insights
> > from
> > > https://github.com/apache/airflow/pull/29433. Your thoughts and
> > > contributions are highly encouraged.
> > >
> > > Best Regards,
> > > Eduardo Nicastro
> > >
> > > On Thu, Jan 11, 2024 at 4:30 PM Eduardo Nicastro <
> edu.nicas...@gmail.com
> > >
> > > wrote:
> > >
> > >> Hello all,
> > >>
> > >> I'm reaching out to propose a topic for discussion that has recently
> > >> emerged in our GitHub discussion threads (#36723
> > >> <https://github.com/apache/airflow/discussions/36723>). It revolves
> > >> around enhancing the management of datasets in a multi-tenant Airflow
> > >> architecture.
> > >>
> > >> Use case/motivation
> > >> In our multi-instance setup, synchronizing dataset dependencies across
> > >> instances poses significant challenges. With the advent of dataset
> > >> listeners, a new door has opened for cross-instance dataset
> awareness. I
> > >> propose we explore creating endpoints to export dataset updates to
> make
> > it
> > >> possible to trigger DAGs consuming from a Dataset across tenants.
> > >>
> > >> Context
> > >> Below I will give some context about our current situation and
> solution
> > >> we have in place and propose a new workflow that would be more
> > efficient.
> > >> To be able to implement this new workflow we would need a way to
> export
> > >> Dataset updates as mentioned.
> > >>
> > >> Current Workflow
> > >> In our organization, we're dealing with multiple Airflow tenants,
> let's
> > >> say Tenant 1 and Tenant 2, as examples. To synchronize Dataset A
> across
> > >> these tenants, we currently have a complex setup:
> > >>
> > >>    1. Containers run on a schedule to export metadata to CosmosDB
> (these
> > >>    will be replaced by the listener).
> > >>    2. Additional scheduled containers pull data from CosmosDB and
> write
> > >>    it to a shared file system, enabling generated DAGS to read it and
> > mirror a
> > >>    dataset across tenants.
> > >>
> > >>
> > >> Proposed Workflow
> > >> Here's a breakdown of our proposed workflow:
> > >>
> > >>    1. Cross-Tenant Dataset Interaction: We have Dags in Tenant 1
> > >>    producing Dataset A. We need a mechanism to trigger all Dags
> > consuming
> > >>    Dataset A in Tenant 2. This interaction is crucial for our data
> > pipeline's
> > >>    efficiency and synchronicity.
> > >>    2. Dataset Listener Implementation: Our approach involves
> > >>    implementing a Dataset listener that programmatically creates
> > Dataset A in
> > >>    all tenants where it's not present (like Tenant 2) and export
> Dataset
> > >>    updates when they happen. This would trigger an update on all Dags
> > >>    consuming from that Dataset.
> > >>    3. Standardized Dataset Names: We plan to use standardized dataset
> > >>    names, which makes sense since a URI is its identifier and
> > uniqueness is a
> > >>    logical requirement.
> > >>
> > >> [image: image.png]
> > >>
> > >> Why This Matters:
> > >>
> > >>    - It offers a streamlined, automated way to manage datasets across
> > >>    different Airflow instances.
> > >>    - It aligns with a need for efficient, interconnected workflows in
> a
> > >>    multi-tenant environment.
> > >>
> > >>
> > >> I invite the community to discuss:
> > >>
> > >>    - Are there alternative methods within Airflow's current framework
> > >>    that could achieve similar goals?
> > >>    - Any insights or experiences that could inform our approach?
> > >>
> > >> Your feedback and suggestions are invaluable, and I look forward to a
> > >> collaborative discussion.
> > >>
> > >> Best Regards,
> > >> Eduardo Nicastro
> > >>
> > >
> >
>

Reply via email to