zikun commented on issue #9914:
URL: https://github.com/apache/airflow/issues/9914#issuecomment-680934697


   Just want to write down my exploration and thoughts which might help if 
anyone wants to pick it up or discuss it further.
   
   I looked into the scheduling logic and it turns out that the behaviour of 
catching up the most recent run was not designed deliberately. Rather, it is a 
"side effect" of the fundamental scheduling logic.
   
   Airflow scheduler runs like a batch job. For every few seconds, it parses 
the DAGs and determines the next dag_run (if any) for every DAG. For example, 
for a DAG scheduled to run hourly `0 * * * *`, a scheduler cycle starting at 
2020-07-22T02:00:09 will schedule the "next" dag_run `next_execution_time = 
2020-07-22T02:00:00` (in other words, `execution_time = 2020-07-22T01:00:00`). 
Note that although I'm using the word **"next"**, for the scheduler, it is 
doing something like a look-back to determine and schedule the "next" dag_run. 
If a DAG is paused at the beginning and is enabled at 2020-07-22T02:30:00, the 
next scheduler cycle will look back and still find the "next" dag_run to be 
`next_execution_time = 2020-07-22T02:00:00`. So **it appears to be a catchup 
but it is really just a longer-delayed scheduling**.
   
   Therefore, if we want to prevent the most-recent catchup, we might have to 
add some more complex logic and modify the DAG model. For example, adding a 
timestamp to record when a DAG is enabled so that scheduler can skip 
most-recent run. I'm not sure if it's worth it for this feature.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to