cmarteepants commented on issue #30974:
URL: https://github.com/apache/airflow/issues/30974#issuecomment-2052540712

   @uranusjr you're closer to this than I am. What is your opinion?
   
   I think there's a lot of value in supporting these scenarios - but there's 
something about the approach that is bothering me. Should these rules be 
defined as `Dataset` args? 
   
   A common pattern that I have seen amongst Astronomer customers is for data 
producers to define datasets in a [consolidated 
file](https://github.com/astronomer/snowpatrol/blob/main/include/datasets.py) 
in order to make them discoverable for data consumers. Data consumers will then 
import them to use for scheduling purposes. 
   
   Intuitively, it's the data consumer that decides how important upstream 
datasets are to their process and how fresh that data needs to be. Even for the 
same dataset, the level of tolerance can be different across DAGs. What if we 
were to do something like this instead?
   
   ```
   with DAG(
       dag_id="multiple_datasets_example",
       schedule=[
           Dataset("s3://dataset/example1.csv"),
           Dataset("s3://dataset/example2.csv"),
           wait_no_longer_than(timedelta(hours=1), 
Dataset("s3://dataset/example3.csv")),
       ],
       ...,
   ):
   ```
   ---
   My other concern is specific to the second scenario. While some dependencies 
are for enrichment, for production there will still be some sort of expectation 
that `multiple_datasets_example` will run on some sort of cadence. How will 
someone know that the reason their DAG isn't running is because the 3rd dataset 
has gone stale? I personally would not be comfortable using a `freshness` rule 
unless I could setup a notification to tell me that it's been more than a month 
since `"s3://dataset/example3.csv"` was updated. My opinion is to punt this 
until we have the ability to be notified, and focus on the 1st scenario. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to