Proposal:  I plan to implement a new timestamp column named `last_queued_at` to 
the `dagrun` table which is updated any time the run is queued, including when 
it is cleared.  DeadlineAlert code will be modified to use this new column for 
any calculations which currently use `dagrun.queued_at` and will fall back on 
`queued_at` if it is `null` or missing.  This will require a small migration 
which sets the new column to `null` for existing rows.

To summarize the discussion [1] regarding the `dagrun.queued_at` field: it 
currently tracks the initial queue time and is never updated, which breaks 
expected behavior of DeadlineAlerts (and maybe other areas?) if a run is 
cleared or re-triggered.  Which means the `queued_at` column essentially 
represents the first time this run was queued, not the most recent time it was 
attempted.  For example, if you expect an email if the run takes more than 30 
minutes from when it was queued and it gets cleared and restarted, you get that 
email 30 minutes from the first time it was queued regardless of how long it 
actually took to run.

There was a good discussion there and on Slack about expectations and a few 
ideas were proposed.  I think these are the two primary options:

Option 1: We leave `dagrun.queued_at` alone to represent the first time it was 
attempted and add a new field to the `dagrun` table which is updated each time 
it is queued, representing the most recent attempt.

Option 2: Add rows to the `Log` table to store when a run was queued/requeued 
(as suggested by Standish) and use that as the source of truth for when a 
specific run was last attempted.


While I like Option 2, it's a bigger project and feels like overkill for this, 
especially considering the recent discussion [2] about the Log table getting 
out of hand on some environments.  I think maybe Option 1 is the right answer.  
It maintains backward compatibility and solves the immediate issue well.

If there are no objections, I'll consider this accepted on Friday, 16 Jan at 
21:00 UTC.


[1] Email thread "DagRun queued_at timestamp discussion": 
https://lists.apache.org/thread/n5y2khy8l9472spoclmql3nj2bskqksj
[2] Email thread "Managing airflow database size and retention": 
https://lists.apache.org/thread/88odp590r1syklo5rok4tq3kxpkhv922


 - ferruzzi

Reply via email to