Maybe I'm missing something, but I can't see how rest endpoints for datasets could work in practice. afaik, Airflow has some objects that cab be created by a dag processor (Dags, Datasets) and others that can be created with API/UI (Connections, Variables), but never both at the same time. How would update/delete endpoints work if Dataset was initially created declaratively from a dag file? Would it throw an exception or make an update that will then be reverted in a little while by a dag-processor anyway?
I always assumed that this was the reason why it's impossible to create dags from API, no one wanted to open this particular can of worms. I think if you need to synchronize these objects, the cleaner way would be to describe them in some sort of a shared config file and let respective dag-processors create them independently of each other. On Tue, Jan 23, 2024 at 4:02 PM Jarek Potiuk <ja...@potiuk.com> wrote: > I am also pretty cool with adding/updating/datasets externally, however I > know there are some ongoing discussions on how to improve/change datasets > and bind them together with multiple other features of Airflow - not sure > what the state of those, but it would be great those effort are coordinated > so that we are not pulling stuff in multiple directions. > > From what I've heard/overheard noticed about Datasets are those things: > > * AIP-60 - > > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-60+Standard+URI+representation+for+Airflow+Datasets > - already almost passed > * Better coupling of datasets with OpenLineage > * Partial datasets - allowing to have datasets with data intervals > * Triggering dags on external dataset changes > * Objects Storage integration with datasets > > All of which sound very promising and are definitely important for Dataset > usage. > > So I think we really make sure when we are doing anything with datasets, > the people who think/work on those aspects above have a say in those > proposals/discussions - it would be a shame if we add something that will > partially invalidate or make terribly complex to implement some of the > other things. > > I am not saying it's the case here, I am just saying that we should at > least make sure that people who are currently thinking about these things > don't come surprised if we merge something that will make their job harder. > > I am a little surprised - knowing the *thinking* happening in dataset area > that I am aware of that there are so little comments on that one (even if > "hey looks cool" - works well for the things I am thinking about) :). > > J. > > > > > On Tue, Jan 23, 2024 at 3:53 AM Ryan Hatter > <ryan.hat...@astronomer.io.invalid> wrote: > > > I don't think it makes sense to include the create endpoint without also > > including dataset update and delete endpoints and updating the Datasets > > view in the UI to be able to manage externally created Datasets. > > > > With that said, I don't think the fact that Datasets are tightly coupled > > with DAGs is a good reason not to include additional Dataset endpoints. > It > > makes sense to me to be able to interact with Datasets from outside of > > Airflow. > > > > On Sat, Jan 20, 2024 at 6:13 AM Eduardo Nicastro <edu.nicas...@gmail.com > > > > wrote: > > > > > Hello all, I have created a Pull Request ( > > > https://github.com/apache/airflow/pull/36929) to make it possible to > > > create a dataset through the API as a modest step forward. This PR is > > open > > > for your feedback. I'm preparing another PR to build upon the insights > > from > > > https://github.com/apache/airflow/pull/29433. Your thoughts and > > > contributions are highly encouraged. > > > > > > Best Regards, > > > Eduardo Nicastro > > > > > > On Thu, Jan 11, 2024 at 4:30 PM Eduardo Nicastro < > edu.nicas...@gmail.com > > > > > > wrote: > > > > > >> Hello all, > > >> > > >> I'm reaching out to propose a topic for discussion that has recently > > >> emerged in our GitHub discussion threads (#36723 > > >> <https://github.com/apache/airflow/discussions/36723>). It revolves > > >> around enhancing the management of datasets in a multi-tenant Airflow > > >> architecture. > > >> > > >> Use case/motivation > > >> In our multi-instance setup, synchronizing dataset dependencies across > > >> instances poses significant challenges. With the advent of dataset > > >> listeners, a new door has opened for cross-instance dataset > awareness. I > > >> propose we explore creating endpoints to export dataset updates to > make > > it > > >> possible to trigger DAGs consuming from a Dataset across tenants. > > >> > > >> Context > > >> Below I will give some context about our current situation and > solution > > >> we have in place and propose a new workflow that would be more > > efficient. > > >> To be able to implement this new workflow we would need a way to > export > > >> Dataset updates as mentioned. > > >> > > >> Current Workflow > > >> In our organization, we're dealing with multiple Airflow tenants, > let's > > >> say Tenant 1 and Tenant 2, as examples. To synchronize Dataset A > across > > >> these tenants, we currently have a complex setup: > > >> > > >> 1. Containers run on a schedule to export metadata to CosmosDB > (these > > >> will be replaced by the listener). > > >> 2. Additional scheduled containers pull data from CosmosDB and > write > > >> it to a shared file system, enabling generated DAGS to read it and > > mirror a > > >> dataset across tenants. > > >> > > >> > > >> Proposed Workflow > > >> Here's a breakdown of our proposed workflow: > > >> > > >> 1. Cross-Tenant Dataset Interaction: We have Dags in Tenant 1 > > >> producing Dataset A. We need a mechanism to trigger all Dags > > consuming > > >> Dataset A in Tenant 2. This interaction is crucial for our data > > pipeline's > > >> efficiency and synchronicity. > > >> 2. Dataset Listener Implementation: Our approach involves > > >> implementing a Dataset listener that programmatically creates > > Dataset A in > > >> all tenants where it's not present (like Tenant 2) and export > Dataset > > >> updates when they happen. This would trigger an update on all Dags > > >> consuming from that Dataset. > > >> 3. Standardized Dataset Names: We plan to use standardized dataset > > >> names, which makes sense since a URI is its identifier and > > uniqueness is a > > >> logical requirement. > > >> > > >> [image: image.png] > > >> > > >> Why This Matters: > > >> > > >> - It offers a streamlined, automated way to manage datasets across > > >> different Airflow instances. > > >> - It aligns with a need for efficient, interconnected workflows in > a > > >> multi-tenant environment. > > >> > > >> > > >> I invite the community to discuss: > > >> > > >> - Are there alternative methods within Airflow's current framework > > >> that could achieve similar goals? > > >> - Any insights or experiences that could inform our approach? > > >> > > >> Your feedback and suggestions are invaluable, and I look forward to a > > >> collaborative discussion. > > >> > > >> Best Regards, > > >> Eduardo Nicastro > > >> > > > > > >