ron-gaist opened a new issue, #56571:
URL: https://github.com/apache/airflow/issues/56571
### Apache Airflow version
3.1.0
### If "Other Airflow 2/3 version" selected, which one?
_No response_
### What happened?
Our large scale setup includes
* ~1000 celery executor workers,
* 15 api servers - 64 worker processes each (with enough resources - having
checked utilization)
Also, maybe relevantly,
* 6 scheduler replicas
* 2 dag processors
* a pgbouncer with a large enough `airflow` connection pool size (doesn't
reach maximum)
* dags with up to 8k tasks (in parallel) and a final task that depends on
all of them.
usually dags are smaller than that, average ~5k tasks
When all workers are active and working on task instances - they all get the
following warning 4 times
**[warning] Starting call to 'airflow.sdk.api.client.Client.request', this
is the %d time calling it. [airflow.sdk.api.client]**
and, the 5th time - they get this error:
**[error] Task execute_workload[$celery_task_uuid] raise unexpected:
ReadTimeout('timed out') [celery.app.trace]**
We investigate this error a little and we found that the error comes from
httpx default timeout
from httpx docs ('https://www.python-httpx.org/advanced/timeouts/'):
```
HTTPX is careful to enforce timeouts everywhere by default.
The default behavior is to raise a TimeoutException after 5 seconds of
network inactivity.
```
### What you think should happen instead?
Airflow should allow users to configure the timeout via `airflow.cfg` to
accommodate users with high-load systems.
For example:
```
[api]
HTTPX_TIMEOUT = # 5 by default
```
Also - maybe add a section to the docs detailing best practices when working
with very high loads to make the api server reliable.
### How to reproduce
(1) Run airflow in a kubernetes cluster with:
~ 1k celery workers
~ 15 api server replicas (64 worker processes. resource limits: 25Gi RAM. 8
CPU cores)
(2) Have large dags so that all 1k workers do tasks in parallel (each task
should take more than 5 mins)
(3) Observe workers for errors (ReadTimeout)
### Operating System
Debian GNU/Linux 12 (bookworm)
### Versions of Apache Airflow Providers
apache-airflow-providers-celery==3.12.2
apache-airflow-providers-common-compat==1.7.3
apache-airflow-providers-common-io==1.6.2
apache-airflow-providers-common-sql=1.27.5
apache-airflow-providers-standard==1.6.0
apache-airflow-providers-postgres==6.2.3
### Deployment
Official Apache Airflow Helm Chart
### Deployment details
_No response_
### Anything else?
Problem occurs everytime that all workers are executing a task instance (the
highest load)
logs:
```
[warning] Starting call to 'airflow.sdk.api.client.Client.request', this is
the 1st time calling it. [airflow.sdk.api.client]
[warning] Starting call to 'airflow.sdk.api.client.Client.request', this is
the 2nd time calling it. [airflow.sdk.api.client]
[warning] Starting call to 'airflow.sdk.api.client.Client.request', this is
the 3rd time calling it. [airflow.sdk.api.client]
[warning] Starting call to 'airflow.sdk.api.client.Client.request', this is
the 4th time calling it. [airflow.sdk.api.client]
[error] Task execute_workload[a7469ad-3481-4fd4-b8f236b37cf1] raise
unexpected: ReadTimeout('timed out') [celery.app.trace]
```
### Are you willing to submit PR?
- [x] Yes I am willing to submit a PR!
### Code of Conduct
- [x] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]