Clarifying: There is no (and it has never been) a problem with opening up submitting "structured" DAGs.
On Tue, Jan 23, 2024 at 2:12 PM Jarek Potiuk <ja...@potiuk.com> wrote: > > I always assumed that this was the reason why it's impossible to create > dags from API, no one wanted to open this particular can of worms. I think > if you need to synchronize these objects, the cleaner way would be to > describe them in some sort of a shared config file and let respective > dag-processors create them independently of each other. > > Just to clarify this one: - creating DAGs via API has been resented mostly > because of security reasons - where you would want to submit Python DAG > code via API. There is (and it has never been) a problem with opening up > submitting "structured" DAGs. This has never been implemented, but if you > would like to limit to just modifying or creating resulting DAG structure, > that would be possible - for example there is no fundamental problem with > generating a DAG from (say) visual representation and submitting a > resulting DAG structure without creating a DAG python file (so essentially > playing the role of DAG file processor to serialize DAGs). It would have a > number of limitations (for example callbacks would not work., timetables > would be a challenge etc.), but other than that it's quite possible (and > possibly even in the future we might have something like that). > > Following that - there are no fundamental problems with submitting > datasets - because they are not Python code, they are pure "metadata" > objects. > > Still the questions remains how it plays with the DAG-created datasets is > an important aspect of the proposal. > > J. > > > On Tue, Jan 23, 2024 at 2:01 PM Tornike Gurgenidze < > togur...@freeuni.edu.ge> wrote: > >> Maybe I'm missing something, but I can't see how rest endpoints for >> datasets could work in practice. afaik, Airflow has some objects that cab >> be created by a dag processor (Dags, Datasets) and others that can be >> created with API/UI (Connections, Variables), but never both at the same >> time. How would update/delete endpoints work if Dataset was initially >> created declaratively from a dag file? Would it throw an exception or make >> an update that will then be reverted in a little while by a dag-processor >> anyway? >> >> I always assumed that this was the reason why it's impossible to create >> dags from API, no one wanted to open this particular can of worms. I think >> if you need to synchronize these objects, the cleaner way would be to >> describe them in some sort of a shared config file and let respective >> dag-processors create them independently of each other. >> >> On Tue, Jan 23, 2024 at 4:02 PM Jarek Potiuk <ja...@potiuk.com> wrote: >> >> > I am also pretty cool with adding/updating/datasets externally, however >> I >> > know there are some ongoing discussions on how to improve/change >> datasets >> > and bind them together with multiple other features of Airflow - not >> sure >> > what the state of those, but it would be great those effort are >> coordinated >> > so that we are not pulling stuff in multiple directions. >> > >> > From what I've heard/overheard noticed about Datasets are those things: >> > >> > * AIP-60 - >> > >> > >> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-60+Standard+URI+representation+for+Airflow+Datasets >> > - already almost passed >> > * Better coupling of datasets with OpenLineage >> > * Partial datasets - allowing to have datasets with data intervals >> > * Triggering dags on external dataset changes >> > * Objects Storage integration with datasets >> > >> > All of which sound very promising and are definitely important for >> Dataset >> > usage. >> > >> > So I think we really make sure when we are doing anything with datasets, >> > the people who think/work on those aspects above have a say in those >> > proposals/discussions - it would be a shame if we add something that >> will >> > partially invalidate or make terribly complex to implement some of the >> > other things. >> > >> > I am not saying it's the case here, I am just saying that we should at >> > least make sure that people who are currently thinking about these >> things >> > don't come surprised if we merge something that will make their job >> harder. >> > >> > I am a little surprised - knowing the *thinking* happening in dataset >> area >> > that I am aware of that there are so little comments on that one (even >> if >> > "hey looks cool" - works well for the things I am thinking about) :). >> > >> > J. >> > >> > >> > >> > >> > On Tue, Jan 23, 2024 at 3:53 AM Ryan Hatter >> > <ryan.hat...@astronomer.io.invalid> wrote: >> > >> > > I don't think it makes sense to include the create endpoint without >> also >> > > including dataset update and delete endpoints and updating the >> Datasets >> > > view in the UI to be able to manage externally created Datasets. >> > > >> > > With that said, I don't think the fact that Datasets are tightly >> coupled >> > > with DAGs is a good reason not to include additional Dataset >> endpoints. >> > It >> > > makes sense to me to be able to interact with Datasets from outside of >> > > Airflow. >> > > >> > > On Sat, Jan 20, 2024 at 6:13 AM Eduardo Nicastro < >> edu.nicas...@gmail.com >> > > >> > > wrote: >> > > >> > > > Hello all, I have created a Pull Request ( >> > > > https://github.com/apache/airflow/pull/36929) to make it possible >> to >> > > > create a dataset through the API as a modest step forward. This PR >> is >> > > open >> > > > for your feedback. I'm preparing another PR to build upon the >> insights >> > > from >> > > > https://github.com/apache/airflow/pull/29433. Your thoughts and >> > > > contributions are highly encouraged. >> > > > >> > > > Best Regards, >> > > > Eduardo Nicastro >> > > > >> > > > On Thu, Jan 11, 2024 at 4:30 PM Eduardo Nicastro < >> > edu.nicas...@gmail.com >> > > > >> > > > wrote: >> > > > >> > > >> Hello all, >> > > >> >> > > >> I'm reaching out to propose a topic for discussion that has >> recently >> > > >> emerged in our GitHub discussion threads (#36723 >> > > >> <https://github.com/apache/airflow/discussions/36723>). It >> revolves >> > > >> around enhancing the management of datasets in a multi-tenant >> Airflow >> > > >> architecture. >> > > >> >> > > >> Use case/motivation >> > > >> In our multi-instance setup, synchronizing dataset dependencies >> across >> > > >> instances poses significant challenges. With the advent of dataset >> > > >> listeners, a new door has opened for cross-instance dataset >> > awareness. I >> > > >> propose we explore creating endpoints to export dataset updates to >> > make >> > > it >> > > >> possible to trigger DAGs consuming from a Dataset across tenants. >> > > >> >> > > >> Context >> > > >> Below I will give some context about our current situation and >> > solution >> > > >> we have in place and propose a new workflow that would be more >> > > efficient. >> > > >> To be able to implement this new workflow we would need a way to >> > export >> > > >> Dataset updates as mentioned. >> > > >> >> > > >> Current Workflow >> > > >> In our organization, we're dealing with multiple Airflow tenants, >> > let's >> > > >> say Tenant 1 and Tenant 2, as examples. To synchronize Dataset A >> > across >> > > >> these tenants, we currently have a complex setup: >> > > >> >> > > >> 1. Containers run on a schedule to export metadata to CosmosDB >> > (these >> > > >> will be replaced by the listener). >> > > >> 2. Additional scheduled containers pull data from CosmosDB and >> > write >> > > >> it to a shared file system, enabling generated DAGS to read it >> and >> > > mirror a >> > > >> dataset across tenants. >> > > >> >> > > >> >> > > >> Proposed Workflow >> > > >> Here's a breakdown of our proposed workflow: >> > > >> >> > > >> 1. Cross-Tenant Dataset Interaction: We have Dags in Tenant 1 >> > > >> producing Dataset A. We need a mechanism to trigger all Dags >> > > consuming >> > > >> Dataset A in Tenant 2. This interaction is crucial for our data >> > > pipeline's >> > > >> efficiency and synchronicity. >> > > >> 2. Dataset Listener Implementation: Our approach involves >> > > >> implementing a Dataset listener that programmatically creates >> > > Dataset A in >> > > >> all tenants where it's not present (like Tenant 2) and export >> > Dataset >> > > >> updates when they happen. This would trigger an update on all >> Dags >> > > >> consuming from that Dataset. >> > > >> 3. Standardized Dataset Names: We plan to use standardized >> dataset >> > > >> names, which makes sense since a URI is its identifier and >> > > uniqueness is a >> > > >> logical requirement. >> > > >> >> > > >> [image: image.png] >> > > >> >> > > >> Why This Matters: >> > > >> >> > > >> - It offers a streamlined, automated way to manage datasets >> across >> > > >> different Airflow instances. >> > > >> - It aligns with a need for efficient, interconnected workflows >> in >> > a >> > > >> multi-tenant environment. >> > > >> >> > > >> >> > > >> I invite the community to discuss: >> > > >> >> > > >> - Are there alternative methods within Airflow's current >> framework >> > > >> that could achieve similar goals? >> > > >> - Any insights or experiences that could inform our approach? >> > > >> >> > > >> Your feedback and suggestions are invaluable, and I look forward >> to a >> > > >> collaborative discussion. >> > > >> >> > > >> Best Regards, >> > > >> Eduardo Nicastro >> > > >> >> > > > >> > > >> > >> >