kirillsights opened a new issue, #55315:
URL: https://github.com/apache/airflow/issues/55315

   ### Apache Airflow version
   
   3.0.6
   
   ### If "Other Airflow 2 version" selected, which one?
   
   _No response_
   
   ### What happened?
   
   After upgrade to airflow 3, system started experiencing random DAG 
disappearance.
   Parsing intervals are setup to be pretty long, because we don't update DAGs 
between deploys.
   The config for intervals has this setup:
   ```
     dag_processor:
       dag_file_processor_timeout: 300
       min_file_process_interval: 7200
       parsing_processes: 1
       print_stats_interval: 300
       refresh_interval: 1800
       stale_dag_threshold: 1800
   ```
   
   Log analysis showed that once we receive one callback on DAG processor for 
any DAG, it soon will be marked as stale and will disappear.
   It may come back later, once process_interval kicks in. But its not always 
the case.
   
   Full log:
   
[dag_processor.log.zip](https://github.com/user-attachments/files/22181554/dag_processor.log.zip)
   
   Points of interest in log:
   
   Last time there is no error for particular DAG:
   ```
   2025-09-04T20:02:57.426Z | {"log":"2025-09-04T20:02:57.426093587Z stdout F 
dags-folder process_etl_app_data.py 1 0 0.96s 2025-09-04T19:58:39"}
   ```
   Then first callback for it comes in:
   ```
   2025-09-04T20:05:08.722Z | {"log":"2025-09-04T20:05:08.722840445Z stdout F 
[2025-09-04T20:05:08.722+0000] {manager.py:464} DEBUG - Queuing 
TaskCallbackRequest CallbackRequest: filepath='process_etl_app_data.py' 
bundle_name='dags-folder' bundle_version=None msg=\"{'DAG Id': 'ds_etl', 'Task 
Id': 'etl_app_data', 'Run Id': 'manual__2025-09-04T20:00:00+00:00', 'Hostname': 
'10.4.142.168', 'External Executor Id': 
'5547a318-f6cc-4c02-92f5-90cbbb629e22'}\" 
ti=TaskInstance(id=UUID('01991650-8c36-70c5-a85b-44f6b572fe0f'), 
task_id='etl_app_data', dag_id='ds_etl', 
run_id='manual__2025-09-04T20:00:00+00:00', try_number=1, map_index=-1, 
hostname='10.4.142.168', context_carrier=None) task_callback_type=None 
context_from_server=TIRunContext(dag_run=DagRun(dag_id='ds_etl', 
run_id='manual__2025-09-04T20:00:00+00:00', 
logical_date=datetime.datetime(2025, 9, 4, 20, 0, tzinfo=Timezone('UTC')), 
data_interval_start=datetime.datetime(2025, 9, 4, 20, 0, 1, 133909, 
tzinfo=Timezone('UTC')), data_interval_end
 =datetime.datetime(2025, 9, 4, 20, 0, 1, 133909, tzinfo=Timezone('UTC')), 
run_after=datetime.datetime(2025, 9, 4, 20, 0, 1, 133909, 
tzinfo=Timezone('UTC')), start_date=datetime.datetime(2025, 9, 4, 20, 0, 1, 
176556, tzinfo=Timezone('UTC')), end_date=None, clear_number=0, 
run_type=<DagRunType.MANUAL: 'manual'>, state=<DagRunState.RUNNING: 'running'>, 
conf={}, consumed_asset_events=[]), task_reschedule_count=0, max_tries=7, 
variables=[], connections=[], upstream_map_indexes=None, next_method=None, 
next_kwargs=None, xcom_keys_to_clear=[], should_retry=False) 
type='TaskCallbackRequest'"}
   ```
   Then during next print of stats we have an error in this file (though it has 
not changed at all):
   ```
   2025-09-04T20:12:58.040Z | {"log":"2025-09-04T20:12:58.040610948Z stdout F 
dags-folder process_etl_app_data.py 0 1 1.01s 2025-09-04T20:12:50"}
   ```
   Eventually the DAG from that file disappears:
   ```
   2025-09-04T20:57:53.765Z | {"log":"2025-09-04T20:57:53.765305682Z stdout F 
[2025-09-04T20:57:53.764+0000] {manager.py:310} INFO - DAG ds_etl is missing 
and will be deactivated."}
   ```
   
   Further analysis showed that DAG processor seems to be reusing same parsing 
mechanism for callback execution and updates file parsing time, though does not 
update DAG parsing time. The DAG eventually becomes stale. 
   
   ### What you think should happen instead?
   
   Processing callbacks should not affect DAG state.
   And I think we should still be able to set reparsing timers for rare parsing.
   
   ### How to reproduce
   
   - Have DAG with callbacks
   - Set `min_file_process_interval` higher than `stale_dag_threshold` and 
deploy airflow
   - Execute DAG, so callbacks are executed 
   
   ### Operating System
   
   Debian Bookworm
   
   ### Versions of Apache Airflow Providers
   
   apache-airflow-providers-amazon==9.12.0
   apache-airflow-providers-celery==3.12.2
   apache-airflow-providers-common-compat==1.7.3
   apache-airflow-providers-common-io==1.6.2
   apache-airflow-providers-common-messaging==1.0.5
   apache-airflow-providers-common-sql==1.27.5
   apache-airflow-providers-fab==2.4.1
   apache-airflow-providers-http==5.3.3
   apache-airflow-providers-postgres==6.2.3
   apache-airflow-providers-redis==4.2.0
   apache-airflow-providers-slack==9.1.4
   apache-airflow-providers-smtp==2.2.0
   apache-airflow-providers-standard==1.6.0
   
   ### Deployment
   
   Official Apache Airflow Helm Chart
   
   ### Deployment details
   
   Helm chart deployed on AWS EKS cluster
   
   ### Anything else?
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@airflow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to