1fanwang opened a new issue, #56342:
URL: https://github.com/apache/airflow/issues/56342

   ### Apache Airflow version
   
   Other Airflow 2 version (please specify below)
   
   ### If "Other Airflow 2 version" selected, which one?
   
   2.9.2
   
   ### What happened?
   
   We observed a DAG run where all tasks executed and completed on July 16, 
2025, but the `dag_run` table shows `start_date` and `end_date` values from 
September 9, 2025 - 55 days later. The scheduler logs for DAG run completion 
also only appear on September 9, not on the original execution date.
   
   ### Observed Case
   
   **Run ID:** `manual__2025-07-16T19:50:37.477631+00:00`
   
   ### Database Evidence
   
   **1. dag_run table:**
   ```sql
   SELECT execution_date, start_date, end_date, state, queued_at, updated_at
   FROM dag_run 
   WHERE dag_id = 'example_dag'
     AND run_id = 'manual__2025-07-16T19:50:37.477631+00:00';
   ```
   
   | execution_date | start_date | end_date | state | queued_at | updated_at |
   |---|---|---|---|---|---|
   | 2025-07-16 19:50:37.477631 | 2025-09-09 20:08:11.865410 | 2025-09-09 
20:08:58.517820 | failed | 2025-09-09 18:48:25 | 2025-09-09 20:08:58.571346 |
   
   **Key observations:**
   - `execution_date`: July 16 (correct)
   - `start_date`: September 9 (55 days later)
   - `end_date`: September 9 (55 days later)
   - `queued_at`: September 9 (55 days later)
   - `updated_at`: September 9
   
   **2. task_instance table:**
   ```sql
   SELECT task_id, start_date, end_date, state, duration, try_number, 
queued_dttm
   FROM task_instance
   WHERE dag_id = 'example_dag'
     AND run_id = 'manual__2025-07-16T19:50:37.477631+00:00'
   ORDER BY start_date;
   ```
   
   All 5 tasks show:
   - `start_date`: 2025-07-16 19:50:xx through 20:57:xx
   - `end_date`: 2025-07-16 19:50:xx through 20:57:xx  
   - `state`: failed
   - Duration: ~4000 seconds (all tasks ran for about 1 hour)
   - `try_number`: 1 (no retries)
   
   **No task activity on September 9** - all task timestamps remain July 16.
   
   ### Scheduler Logs
   
   **September 9, 2025 20:08 (logs available):**
   ```
   [2025-09-09T20:08:58.517+0000] {dagrun.py:869} ERROR - Marking run <DagRun 
example_dag @ 2025-07-16 19:50:37.477631+00:00: 
     manual__2025-07-16T19:50:37.477631+00:00, state:running, queued_at: 
2025-09-09 18:48:25+00:00. externally triggered: True> failed
   
   [2025-09-09T20:08:58.570+0000] {dagrun.py:951} INFO - DagRun Finished: 
     dag_id=example_dag, 
     execution_date=2025-07-16 19:50:37.477631+00:00, 
     run_id=manual__2025-07-16T19:50:37.477631+00:00, 
     run_start_date=2025-09-09 20:08:11.865410+00:00, 
     run_end_date=2025-09-09 20:08:58.517820+00:00, 
     run_duration=46.65241, 
     state=failed
   ```
   
   all available logs
   ```
   airflow-scheduler-66b9c8684f-qbj58   9/9/2025, 1:08:58.5178885 PM    
[2025-09-09T20:08:58.517+0000] {dagrun.py:869} ERROR - Marking run <DagRun 
example_dag @ 2025-07-16 19:50:37.477631+00:00: 
manual__2025-07-16T19:50:37.477631+00:00, state:running, queued_at: 2025-09-09 
18:48:25+00:00. externally triggered: True> failed
   airflow-scheduler-66b9c8684f-qbj58   9/9/2025, 1:08:58.5708506 PM    
[2025-09-09T20:08:58.570+0000] {dagrun.py:951} INFO - DagRun Finished: 
dag_id=example_dag, execution_date=2025-07-16 19:50:37.477631+00:00, 
run_id=manual__2025-07-16T19:50:37.477631+00:00, run_start_date=2025-09-09 
20:08:11.865410+00:00, run_end_date=2025-09-09 20:08:58.517820+00:00, 
run_duration=46.65241, state=failed, external_trigger=True, run_type=manual, 
data_interval_start=2025-07-16 07:29:00+00:00, data_interval_end=2025-07-16 
17:29:00+00:00, dag_hash=e43446634a74294f3c5a805b28d768f6
   airflow-webserver-b9c5bf7b9-zl9t5    9/16/2025, 12:01:11.9502801 PM  
2620:119:5000:1003:1cf1:1c35:38d8:8810 - - [16/Sep/2025:19:01:11 +0000] "GET 
/api/v1/dags/example_dag/dagRuns/manual__2025-07-16T19:50:37.477631+00:00/taskInstances/fiftyTbShuffle4
 HTTP/1.1" 200 924 
"https://airflow.company.com/dags/example_dag/grid?execution_date=2025-09-12T06%3A12%3A44.648119%2B00%3A00&tab=details&dag_run_id=manual__2025-07-16T19%3A50%3A37.477631%2B00%3A00&num_runs=365&task_id=fiftyTbShuffle4";
 "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, 
like Gecko) Chrome/139.0.0.0 Safari/537.36" 0.654790 seconds
   ```
   
   **July 16, 2025 (logs not available - outside retention window):**
   We cannot retrieve scheduler logs from July 16 due to log retention 
policies. We do not know if any "DagRun Finished" or state transition logs were 
emitted at that time.
   
   ### Audit Log Evidence
   
   ```sql
   SELECT dttm, event, owner, extra
   FROM log 
   WHERE dag_id = 'example_dag'
     AND dttm BETWEEN '2025-07-15' AND '2025-09-10'
     AND event IN ('paused', 'unpaused')
   ORDER BY dttm;
   ```
   
   Selected entries:
   - 2025-07-15 07:15:47 | paused | user1 | is_paused=false
   - 2025-07-15 18:21:02 | paused | user2 | is_paused=true
   - 2025-07-15 18:21:07 | paused | user2 | is_paused=false
   - 2025-07-16 21:02:38 | paused | user1 | is_paused=false
   - 2025-07-17 07:05:48 | paused | user1 | is_paused=true
   - 2025-07-25 20:13:23 | paused | user2 | is_paused=false
   - 2025-07-30 20:40:45 | paused | user1 | is_paused=false
   - 2025-07-30 20:41:32 | paused | user1 | is_paused=true
   - 2025-07-30 23:40:05 | paused | user2 | is_paused=false
   - 2025-07-31 23:00:31 | paused | user2 | is_paused=true
   - 2025-08-04 22:44:25 | paused | user2 | is_paused=false
   - 2025-08-14 18:16:57 | paused | user1 | is_paused=true
   - 2025-08-14 21:35:31 | paused | user1 | is_paused=false
   - 2025-08-25 17:18:32 | paused | user1 | is_paused=true
   - 2025-09-04 22:58:03 | paused | user1 | is_paused=true
   - 2025-09-09 03:51:45 | paused | user2 | is_paused=true
   
   The DAG experienced multiple pause/unpause cycles between July 17 and 
September 9.
   
   **No manual state changes found:**
   ```sql
   SELECT COUNT(*) 
   FROM log 
   WHERE dag_id = 'example_dag'
     AND run_id = 'manual__2025-07-16T19:50:37.477631+00:00'
     AND event IN ('clear', 'mark_success', 'mark_failed');
   -- Result: 0
   ```
   
   ### Listener Hook Observations
   
   We have custom listener plugins that capture DAG run state changes via hooks 
like `on_dag_run_failed()`. For this specific run, we observed multiple 
"failed" events emitted on **July 16** (the original execution date) according 
to our downstream event store. However, the `dag_run` table shows the state was 
still "running" on July 16.
   
   This raises questions about:
   - Timing of when listener hooks fire relative to database commits
   - Whether there's a race condition between event emission and state 
persistence
   - Why multiple "failed" events would fire for the same state transition
   
   ### Broader Pattern
   
   Query across all DAG runs:
   ```sql
   SELECT dag_id, 
          DATE(execution_date) as exec_date,
          DATE(start_date) as start_date,
          DATEDIFF(start_date, execution_date) as days_delay, 
          state
   FROM dag_run 
   WHERE DATEDIFF(start_date, execution_date) > 30
     AND execution_date >= '2025-07-01'
   ORDER BY days_delay DESC 
   LIMIT 10;
   ```
   
   Results show 10+ different DAGs exhibiting the same pattern with 30-80 day 
delays between `execution_date` and `start_date`/`end_date`.
   
   ### What you think should happen instead?
   
   The dag_run.start_date and dag_run.end_date fields should reflect when the 
tasks actually executed (July 16), not when the scheduler finally processed the 
run state (September 9).
   Expected behavior:
   
   dag_run.start_date should be July 16 19:50:37 (when tasks started)
   dag_run.end_date should be July 16 20:57:48 (when last task ended)
   These timestamps should not change when the scheduler finalizes the run 
state weeks/months later
   
   ### How to reproduce
   
   We have not been able to reliably reproduce this, but the pattern we 
observed was:
   
   1. Manual DAG trigger on July 16
   2. All tasks execute and fail between July 16 19:50-20:57
   3. DAG has multiple pause/unpause operations between July 17 and September 9
   4. On September 9, scheduler emits "DagRun Finished" log with 
`run_start_date` and `run_end_date` set to September 9
   5. `dag_run` table shows September 9 for `start_date`, `end_date`, 
`queued_at`, and `updated_at`
   
   We suspect this may be related to the DAG being paused while the run was in 
"running" state, but we cannot confirm this without July logs.
   
   ### Operating System
   
   NAME="Common Base Linux Mariner" VERSION="2.0.20241006" ID=mariner 
VERSION_ID="2.0" PRETTY_NAME="CBL-Mariner/Linux"
   
   ### Versions of Apache Airflow Providers
   
   apache-airflow-providers-cncf-kubernetes 8.4.2
   apache-airflow-providers-common-compat   1.6.1
   apache-airflow-providers-common-io       1.5.4
   apache-airflow-providers-common-sql      1.26.0
   apache-airflow-providers-fab             1.5.3
   apache-airflow-providers-ftp             3.12.2
   apache-airflow-providers-http            5.2.0
   apache-airflow-providers-imap            3.8.2
   apache-airflow-providers-smtp            2.0.3
   apache-airflow-providers-sqlite          3.3.1
   
   ### Deployment
   
   Official Apache Airflow Helm Chart
   
   ### Deployment details
   
   - Kubernetes-based deployment with separate scheduler and worker pods
   - MySQL backend database  
   - Using listener plugins to capture DAG run state changes
   
   ### Anything else?
   
   **What we need help understanding:**
   
   1. **What could cause `start_date`/`end_date` to be updated 55 days after 
execution?** Where in the codebase would this timestamp update occur?
   
   2. **Why does `queued_at` show September 9?** The log says "queued_at: 
2025-09-09 18:48:25" but the run was triggered manually on July 16.
   
   3. **What causes the DAG run to remain in "running" state for 55 days?** 
Task instances all completed on July 16, so what prevents the scheduler from 
finalizing the run?
   
   4. **Are pause/unpause operations relevant?** The audit log shows the DAG 
was paused the day after execution and had multiple pause cycles before 
September 9.
   
   5. **Why would listener hooks fire multiple times on July 16?** Our event 
store shows multiple "failed" notifications on July 16, but the DB shows 
"running" state at that time.
   
   6. **Is this expected behavior?** Should `start_date`/`end_date` represent 
actual task execution time or scheduler processing time?
   
   **What we've confirmed:**
   - ✅ Tasks did not re-execute (all task_instance timestamps are July 16)
   - ✅ No manual state changes (no clear/mark_success/mark_failed operations)
   - ✅ Pattern affects multiple unrelated DAGs
   - ✅ `updated_at` field confirms the record was modified on September 9
   
   We're happy to provide additional database queries, configuration details, 
or test cases. We can also contribute a fix once we understand the root cause 
and expected behavior.
   
   ### Are you willing to submit PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to