cmarteepants commented on issue #30974: URL: https://github.com/apache/airflow/issues/30974#issuecomment-2052540712
@uranusjr you're closer to this than I am. What is your opinion? I think there's a lot of value in supporting these scenarios - but there's something about the approach that is bothering me. Should these rules be defined as `Dataset` args? A common pattern that I have seen amongst Astronomer customers is for data producers to define datasets in a [consolidated file](https://github.com/astronomer/snowpatrol/blob/main/include/datasets.py) in order to make them discoverable for data consumers. Data consumers will then import them to use for scheduling purposes. Intuitively, it's the data consumer that decides how important upstream datasets are to their process and how fresh that data needs to be. Even for the same dataset, the level of tolerance can be different across DAGs. What if we were to do something like this instead? ``` with DAG( dag_id="multiple_datasets_example", schedule=[ Dataset("s3://dataset/example1.csv"), Dataset("s3://dataset/example2.csv"), wait_no_longer_than(timedelta(hours=1), Dataset("s3://dataset/example3.csv")), ], ..., ): ``` --- My other concern is specific to the second scenario. While some dependencies are for enrichment, for production there will still be some sort of expectation that `multiple_datasets_example` will run on some sort of cadence. How will someone know that the reason their DAG isn't running is because the 3rd dataset has gone stale? I personally would not be comfortable using a `freshness` rule unless I could setup a notification to tell me that it's been more than a month since `"s3://dataset/example3.csv"` was updated. My opinion is to punt this until we have the ability to be notified, and focus on the 1st scenario. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
