Ethan-Henley opened a new issue #20652:
URL: https://github.com/apache/airflow/issues/20652
### Apache Airflow version
2.1.0
### What happened
Noticed today that running tasks that involve connecting to a postgres
database via sqlalchemy inconsistently succeed or fail due to connection
timeout, apparently at random. When they fail, get logs as below.
Postgres server is running normally as per tests. Did not experience any
such timeouts before the holiday and have only pushed one edit since, adding a
sensor at the start of the dag where this was first noticed.
### What you expected to happen
Would expect this to consistently succeed (or consistently fail, signaling a
clearer problem with the server or connection).
### How to reproduce
1. Set up Azure account and environment
2. Create Azure Kubernetes Service (below)
```
az aks create -g PROJECTNAME-rg -n kube-PROJECTNAME --node-vm-size
Standard_D2_v3 --node-count 2 --generate-ssh-keys --nodepool-name system
az aks nodepool update --resource-group PROJECTNAME-rg --cluster-name
kube-PROJECTNAME --name system --mode System
az aks nodepool add --resource-group PROJECTNAME-rg --cluster-name
kube-PROJECTNAME --name computepool --node-vm-size Standard_D8_v3 --node-count
3 --labels agentpool=app01 --mode User
az aks get-credentials --name kube-PROJECTNAME --resource-group
PROJECTNAME-rg
az aks nodepool update --resource-group PROJECTNAME-rg --cluster-name
kube-PROJECTNAME --name computepool --enable-cluster-autoscaler --min-count 1
--max-count 3
```
3. Create postgres servers for airflow and for project data
az postgres server create --resource-group PROJECTNAME-rg --name
PROJECTNAME-airflow --location eastus --admin-user <USER> --admin-password
<PASS> --sku-name GP_Gen5_2 --version 11
az postgres server firewall-rule create --resource-group PROJECTNAME-rg
--server-name PROJECTNAME-airflow --name AirflowRule --start-ip-address 0.0.0.0
--end-ip-address 0.0.0.0
az postgres server create --resource-group PROJECTNAME-rg --name PROJECTNAME
--location eastus --admin-user <USER> --admin-password <PASS> --sku-name
GP_Gen5_8 --version 11
az postgres server firewall-rule create --resource-group PROJECTNAME-rg
--server-name PROJECTNAME --name DataRule --start-ip-address 0.0.0.0
--end-ip-address 0.0.0.0
az postgres server update \
--resource-group PROJECTNAME-rg \
--name PROJECTNAME \
--storage-size 550000
4. Set up DNS zone as per
[https://github.com/kubernetes-sigs/external-dns/blob/master/docs/tutorials/azure.md](url)
### Operating System
Running this in Kubernetes on an Azure server.
### Versions of Apache Airflow Providers
apache-airflow-providers-microsoft-azure==1.0.0
### Deployment
Other
### Deployment details
Helm version 3.0 as per
[https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3](url),
uses
helm.sh/chart: airflow-2.1.0
### Anything else
Log from a failed task run:
```
Running <TaskInstance: DAG_NAME.TASK_NAME DATE_TIME [queued]> on host
POD_ADDRESS
Traceback (most recent call last):
File
"/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/engine/base.py",
line 2336, in _wrap_pool_connect
return fn()
File
"/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/base.py",
line 364, in connect
return _ConnectionFairy._checkout(self)
File
"/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/base.py",
line 778, in _checkout
fairy = _ConnectionRecord.checkout(pool)
File
"/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/base.py",
line 495, in checkout
rec = pool._do_get()
File
"/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/impl.py",
line 241, in _do_get
return self._create_connection()
File
"/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/base.py",
line 309, in _create_connection
return _ConnectionRecord(self)
File
"/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/base.py",
line 440, in __init__
self.__connect(first_connect_check=True)
File
"/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/base.py",
line 661, in __connect
pool.logger.debug("Error on connect(): %s", e)
File
"/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/util/langhelpers.py",
line 68, in __exit__
compat.raise_(
File
"/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/util/compat.py",
line 182, in raise_
raise exception
File
"/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/pool/base.py",
line 656, in __connect
connection = pool._invoke_creator(self)
File
"/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/engine/strategies.py",
line 114, in connect
return dialect.connect(*cargs, **cparams)
File
"/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/engine/default.py",
line 508, in connect
return self.dbapi.connect(*cargs, **cparams)
File
"/home/airflow/.local/lib/python3.8/site-packages/psycopg2/__init__.py", line
127, in connect
conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: could not connect to server: Connection timed out
Is the server running on host "POSTGRES_SERVER_ADDRESS"
(POSTGRES_SERVER_IP) and accepting
TCP/IP connections on port 5432?
```
### Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
### Code of Conduct
- [X] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]