patrickbrady-xaxis opened a new issue #21584: URL: https://github.com/apache/airflow/issues/21584
When migrating from 2.1.2 to 2.1.4, the migration modified indices on the xcom table, adding a primary key on `dag_id`, `task_id`, `key`, and `execution_date` and removing a separate index on `dag_id`, `task_id`, and `execution_date` My team's specific implementation runs thousands of the same dag_ids simultaneously with different configurations, so with only the primary key on `dag_id`, `task_id`, `key`, and `execution_date`, every xcom update query was only able to narrow as far as `dag_id + task_id`, leaving thousands of rows to scan for a matching `execution_date`. All of our tasks update xcom with status codes, and many of the tasks have similar run times across different dag runs, leading to large numbers of concurrent requests with `execution_date` as the only distinguishing factor, and tasks intermittently failing due to deadlocks on the xcom table. We patched the issue by adding back the separate index to our xcom table on `dag_id`, `task_id`, and `execution_date`. My guess is that there may be a more efficient index scheme, but this has resolved the deadlocking behavior so far. My understanding is that in the latest airflow version, `execution_date` has been replaced with `run_id`, but the overall scenario would be the same: where a system performs large numbers of concurrent runs of the same dags and tasks, the xcom table needs to be able to look up individual runs from an index to avoid scanning many rows and potentially deadlocking. _Originally posted by @patrickbrady-xaxis in https://github.com/apache/airflow/issues/16982#issuecomment-1035050661_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
