akomisarek opened a new issue, #50890:
URL: https://github.com/apache/airflow/issues/50890

   ### Apache Airflow version
   
   Other Airflow 2 version (please specify below)
   
   ### If "Other Airflow 2 version" selected, which one?
   
   2.10.x
   
   ### What happened?
   
   We were aware of this change https://github.com/apache/airflow/pull/38891 
while upgrading to 2.10, which solves some problems/expectations reported here: 
https://github.com/apache/airflow/issues/38826
   
   Unfortunately, it broke for in quite weird circumstances. We are working of 
improvements on our end, but we believe it's actually unexpected behaviour of 
the feature. 
   
   We have got many DAGs in single instance (close to 2k) and we have 
deployment process to K8s, with image without DAGs, they are only subsequently 
synced. We also have `dag_processor` enabled. 
   
   What is happening is one of two things:
   
   * The `dag_processor` decides to remove DAGs which have not been parsed for 
some time, due to many DAGs/slow parsing
   * The DAGs are deactivated during the Airflow startup.
   
   At this moment any `dataset` producing task which finishes while the 
downstream Dataset scheduled DAG is deactivated will produce event which will 
be ignored. It will be only picked up during subsequent execution, but it can 
lead to delays. 
   
   I was reading the conversation that the `catchup` flag is not used and I 
feel like this is the problem for us - it intuitively doesn't make sense it is 
ignored, i.e. I wouldn't expect DAG to pick up more Dataset on subsequent 
trigger, rather as soon as possible. 
   
   ### What you think should happen instead?
   
   I believe either one of two things should happen:
   
   * If `catchup` is configured, the DAG should be scheduled immediately when 
it is activated/appears.This would be consistent behaviour of Time scheduling 
and catchup parameter. 
   * OR maybe the change introduced in 
https://github.com/apache/airflow/pull/38891 could be relaxed and still trigger 
deactivated DAGs (only paused could be ignored?). 
   
   Any other ideas? We are obviously working on hour end to avoid long 
parsing/deactivations, but I believe this behaviour is quite confusing. It was 
quite challenging to spot/troubleshoot and led to daily data delays (in some 
instances longer if you were extremely unlucky) 
   
   Is Airflow 3 handling this any better? 
   
   ### How to reproduce
   
   I believe our scenario can be reproducing by having Dataset aware DAG, and 
appropriate consumer. Removing the consumer while the upstream job is 
triggered, and then reading it upon completion. 
   
   The `dataset` event won't cause downstream DAG to execute, but next upstream 
execution will trigger downstream and pass two events as one execution. 
   
   ### Operating System
   
   K8s build from base images
   
   ### Versions of Apache Airflow Providers
   
   N/A - can be reproduced on raw Airflow. 
   
   ### Deployment
   
   Official Apache Airflow Helm Chart
   
   ### Deployment details
   
   Helmfile to K8s 
   
   ### Anything else?
   
   Root cause is:
   
   ```
   2025-05-03 03:09:03.673 | [2025-05-03T02:09:03.673+0000] {manager.py:537} 
INFO - DAG {DAG_ID} is missing and will be deactivated. |  
   -- | -- | --
     |   | 2025-05-03 03:09:03.678 | [2025-05-03T02:09:03.678+0000] 
{manager.py:549} INFO - Deactivated 1 DAGs which are no longer present in file. 
|  
     |   | 2025-05-03 03:09:03.688 | [2025-05-03T02:09:03.688+0000] 
{manager.py:553} INFO - Deleted DAG {DAG_ID} in serialized_dag table
   
   ```
   
   it happens for us at scale every couple of days with current setup
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to