baolsen opened a new issue #10779:
URL: https://github.com/apache/airflow/issues/10779


   **Apache Airflow version**: 1.10.8
   
   **Environment**:
   
   - **Cloud provider or hardware configuration**: 4 VCPU 8GB RAM VM
   - **OS** (e.g. from /etc/os-release): RHEL 7.7
   - **Kernel** (e.g. `uname -a`): Linux <redacted> 3.10.0-957.el7.x86_64 #1 
SMP Thu Oct 4 20:48:51 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
   - **Install tools**:
   - **Others**:
   
   **What happened**:
   
   Airflow Scheduler is unable to create any DAG Runs if there is an existing 
externally triggered DAG Run for the same execution date as the scheduler would 
have created. (This prevents creation of future DAG Runs as well).
   
   This happens if a DAG is sometimes scheduled externally and by Airflow. This 
can happen if migrating from an external scheduler to Airflow scheduler, for 
example.
   
   **What you expected to happen**:
   
   I think the Airflow scheduler should skip over any externally triggered DAG 
Runs when creating DAG runs, provided the execution date is the same, and it 
should not produce an error. It should be able to create the future DAG Runs 
(ones not created yet via external process).
   
   The cause seems to be these 2 lines:
   
https://github.com/apache/airflow/blob/1959d6aee2b0fae21502c46eb3dcf711eae71391/airflow/jobs/scheduler_job.py#L558
   
https://github.com/apache/airflow/blob/1959d6aee2b0fae21502c46eb3dcf711eae71391/airflow/jobs/scheduler_job.py#L565
   
   **How to reproduce it**:
   
   1. Create any DAG with Schedule = None, start_date = few days ago. Eg 
2020-08-06.
   2. Create an externally triggered DAG Run using browse -> DAG Runs -> 
Create, for an execution date which aligns with the start date and midnight, eg 
2020-08-06 00:00:00. 
   3. Change the DAG from Schedule = None to Schedule = @daily (for example)
   4. The scheduler will be unable to create DAG Runs due to duplicate key on 
dag_id + execution_date, with the externally-created DAG Run.
   
   Example from the scheduler logs for the associated DAG:
   sqlalchemy.exc.IntegrityError: (pyodbc.IntegrityError) ('23000', "[23000] 
[Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Violation of UNIQUE KEY 
constraint 'UQ__dag_run__F78A98990F629538'. Cannot insert duplicate key in 
object 'dbo.dag_run'. The duplicate key value is (some_dag, 2020-08-06 
00:00:00.000000). (2627) (SQLExecDirectW)")
   
   5. We can modify the DAG Run to no longer have the "externally triggered" 
flag, using the Browse -> DAG Runs UI. 
   Then the Airflow scheduler is able to detect the externally created DAG Run, 
and skip over it, and create new DAG Runs.
   
   **Anything else we need to know**:
   
   I'd like to understand if there is a good reason for filtering out 
externally triggered DAG runs when counting the active DAGs in the scheduler 
code (linked above). It seems to be there intentionally. I looked in the git 
blame to try and understand, but it goes back 4+ years like that so I couldn't 
find out. Hope someone familiar with the scheduler can comment


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to