uranusjr commented on issue #30974:
URL: https://github.com/apache/airflow/issues/30974#issuecomment-2081854430

   I’ve been thinking about this. Yeah the above `wait_no_longer_than` 
interface seems to be the way to go (maybe with a different name, say 
`DatasetTimeout`).
   
   I initially considered using a timetable for this, but ultimately this does 
not work since we may want to configure the “timeout” of each dataset 
differently, and a timetable would combine very awkwardly since it carries too 
much other information.
   
   (Side note: I think we will need to clean this up a lot as a part of Airflow 
3, likely completely redesign the `schedule` API including how both timetables 
and datasets are passed in.)
   
   Other than timedelta, I think maybe a cron schedule might make sense? Or 
even more sense? I am not exactly sure how we should interpret a timedelta. Say 
I expect a dataset to fire every day, so say I set `timedelta(days=2)`. Things 
generally fire at midnight, but one event got delayed a little and fired on 
2am. The next one missed. Should the timeout trigger on 2am or midnight? That’s 
a minor design decision we can figure out later.
   
   Adding that flag on Dataset itself feels wrong to me since it’d force 
everything that depends on a dataset to have the same timeout. It’s not an 
entirely unreasonable requirement, but is a bit unnecessary to me. The idea of 
freshness ultimately does not live on the dataset itself IMO.
   
   Another thing we need to consider (when we implement this) is, how should we 
signal a timeout event? Should it emit a DatasetEvent with a flag (what)? 
Should it just be implied by the user? Should it be another kind of event?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to