Haotian Wu created AIRFLOW-3045:
-----------------------------------
Summary: Duplicate entry error with MySQL when update
task_instances
Key: AIRFLOW-3045
URL: https://issues.apache.org/jira/browse/AIRFLOW-3045
Project: Apache Airflow
Issue Type: Bug
Components: database
Affects Versions: 1.10.0
Reporter: Haotian Wu
Assignee: Haotian Wu
h3. How to reproduce
# Setup apach-airflow==1.10.0 with MySQL, bring up both webserver and scheduler.
# Add a DAG and it becomes running but none of the task will be actually
executed.
# Manually trigger another run for the same dag, airflow scheduler will crash
with error {{sqlalchemy.exc.IntegrityError: (_mysql_exceptions.IntegrityError)
(1062, "Duplicate entry 'xxxxxxx-yyyyyy-YYYY-MM-DD ...' for key 'PRIMARY'")}}
h3. The Reason
In Airflow-1.10.0, execution_date field of task_instance is changed from
DateTime to Timestamp. However, in MySQL first Timestamp column in a table is
declared with {{ON UPDATE CURRENT_TIMESTAMP}} clause. Database in MySQL will
look like below after {{airflow initdb}}.
| Field | Type | Null | Key | Default | Extra |
| task_id | varchar(250) | NO | PRI | NULL | |
| dag_id | varchar(250) | NO | PRI | NULL | |
| execution_date | timestamp(6) | NO | PRI | CURRENT_TIMESTAMP(6) | on
update CURRENT_TIMESTAMP(6) |
# When a task_instance is updated from state NULL to state "scheduled", its
execution_date is also reset to current timestamp automatically.
# task_instance is linked to a given dag_run by same execution_date, so changed
execution_date means task_instance is no longer linked to any known dag_run.
# If there are more than one dag_run for the same dag, multiple task_instance
with same <task_id, dag_id> will be "unlinked" to their dag_run. Airflow
scheduler will try to update them to state NULL and thus try to update them to
the same <task_id, dag_id, execution_date> primary key.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)