jward-bw opened a new issue #12654:
URL: https://github.com/apache/airflow/issues/12654


   **Description**
   Make it possible to merge multiple backfills into a single run, by extending 
the `start_date` of a single dagrun to cover a time period inclusive of all 
backfills. 
   
   **Use case / motivation**
   There are cases where running multiple backfills is less efficient than 
having a single run, for example where tasks in successive runs would do 
duplicate work. 
   
   ***An example:***
   
   - We have a dag which runs every 6 hours, and processes batches of messages 
from the previous 6 hours by looking at the `execution_date` and the 
`next_execution_date` macro. 
   - This dag has a task which launches a scan across a very large HBase table 
looking for matching rows to apply these messages to. The scan takes the same 
amount of time regardless of the batch size. The scan is the most 
time-consuming part of the dagrun (let's say it takes 3 out of 4 hours for an 
average dagrun).
   - An external error causes 3 successive dagruns to fail.
   
   At this point we have 18 hours of data to catch up on. Assuming the external 
issue has been fixed, this would take on average 12 hours to process, meaning 
further delays to processing future jobs. If instead we could merge these runs 
into a single backfill, this would reduce the processing time from 12 hours to 
something like 6 hours, greatly reducing the impact of delayed processing and 
also resource usage on Airflow and HBase (in this case, but in general other 
external services).
   
   This issue of inefficient processing is one that I (and I'm sure others) 
have a need to solve. There are obviously other workarounds one could do but I 
don't think they are correct in the sense of Airflow good practices. For 
example:
   - Temporarily alter the schedule interval to cover the desired range.
   - Introduce an override in the Airflow variables to make the next run 
process X batches.
   - Temporarily alter the dag code.
   - Run the dag tasks manually and externally to airflow, with the desired 
parameters.
   
   All of these have their own pitfalls and invariably involve some other 
manual intervention in Airflow to ensure the database is kept accurate and/or 
future runs aren't affected.
   
   If there is some other solution to this problem that I am unaware of, please 
let me know. I have raised this as an RFC as any change that implements this 
feature would touch many areas of the code base, so would require some planning.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to