avinovarov opened a new issue #22504:
URL: https://github.com/apache/airflow/issues/22504
### Apache Airflow version
2.2.3
### What happened
**The problem**
- Under some load, with hundreds of DAGs running in parallel, Airflow
executors RANDOMLY throw errors on creating connections:
```
(some connections successfully created)
...
creating: raw/pg_services/folder/connection_name
[2022-03-23 02:39:43,102] {connection.py:404} ERROR - Unable to retrieve
connection from secrets backend (MetastoreBackend). Checking subsequent secrets
backend.
```
It is reproduced on creating random connections, on about 25-50% Airflow
workers, quite a lot of workers succeed in creating connections.
**These timeouts happen only when we have dozens of DAGs running in
parallel.**
### What you think should happen instead
We'd assume that connections should be created on stable basis =)
### How to reproduce
- Deploy Airflow to k8s and add connections to multiple Postgres databases
(we have 75)
- Run dozens of DAGs in parallel.
### Operating System
k8s via rancher, on CentOS 7
### Versions of Apache Airflow Providers
apache-airflow-providers-postgres==2.4.0
### Deployment
Other 3rd-party Helm chart
### Deployment details
**Our setup**
- Airflow on kubernetes, with KubernetesExecutor, installed with [user
community Helm
chart](https://github.com/airflow-helm/charts/blob/main/charts/airflow/values.yaml)
- 75 connections to various sources, mainly Postgres databases, specified in
helm chart values, like this:
```
# this is how we add connections with credentials in helm chart values
connections:
- id: pg_connection
type: postgres
host: database.domain.com
login: $PG_LOGIN
password: $PG_PASSWORD
port: 5432
schema: database
# and specify credentials with secrets below
connectionsTemplates:
PG_LOGIN:
kind: secret
name: airflow-secrets
key: PG_LOGIN
```
Of course we have k8s secrets deployed in our `airflow` namespace, and as
long as we run individual DAGs we observe no errors.
### Anything else
**As long as we run individual DAGs we observe no errors.**
Based on the timeout error we assume that the issue is with gaining
credentials (which apparently falls back to secondary credentials provider),
not with connection to Postgres databases themselves, but this is just our
guess. We also don't observe any overload on our Postgres databases.
Googling the error didn't help much, so we'd be grateful for any advice.
Thanks!
### Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
### Code of Conduct
- [X] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]