uranusjr commented on issue #30974: URL: https://github.com/apache/airflow/issues/30974#issuecomment-2081854430
I’ve been thinking about this. Yeah the above `wait_no_longer_than` interface seems to be the way to go (maybe with a different name, say `DatasetTimeout`). I initially considered using a timetable for this, but ultimately this does not work since we may want to configure the “timeout” of each dataset differently, and a timetable would combine very awkwardly since it carries too much other information. (Side note: I think we will need to clean this up a lot as a part of Airflow 3, likely completely redesign the `schedule` API including how both timetables and datasets are passed in.) Other than timedelta, I think maybe a cron schedule might make sense? Or even more sense? I am not exactly sure how we should interpret a timedelta. Say I expect a dataset to fire every day, so say I set `timedelta(days=2)`. Things generally fire at midnight, but one event got delayed a little and fired on 2am. The next one missed. Should the timeout trigger on 2am or midnight? That’s a minor design decision we can figure out later. Adding that flag on Dataset itself feels wrong to me since it’d force everything that depends on a dataset to have the same timeout. It’s not an entirely unreasonable requirement, but is a bit unnecessary to me. The idea of freshness ultimately does not live on the dataset itself IMO. Another thing we need to consider (when we implement this) is, how should we signal a timeout event? Should it emit a DatasetEvent with a flag (what)? Should it just be implied by the user? Should it be another kind of event? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
