GitHub user Urus1201 added a comment to the discussion: Airflow Dynamic Task
Fails with httpx time out on an Instance with high load and around 300 active
DAGs.
The root cause is that your **API servers are heavily under-provisioned** for a
300-DAG, 8-scheduler setup. The Airflow 3 task SDK communicates with the API
server for XCom reads (and other task operations) — when the API server is
saturated, those calls time out.
## Why this works on dev but not prod
In Airflow 3, tasks fetch XComs by calling the REST API
(`/xcoms/{dag_id}/{run_id}/{task_id}/{key}`) rather than reading from the
metastore directly. With 8 schedulers and 300 active DAGs all running tasks
simultaneously, your 2 API server pods with **1 worker each** = **2 total
worker processes** handling all API traffic. That is far too few.
## Fix 1: Increase API server workers (highest impact)
In your Helm chart values:
```yaml
apiServer:
replicas: 2
workers: 4 # Increase from 1 to 4 (or more)
```
Or if using gunicorn-based config:
```ini
[api_server]
workers = 4
worker_timeout = 120
```
This alone will likely resolve the timeout.
## Fix 2: Increase the SDK client timeout
Set a higher timeout for the task SDK's HTTP client:
```ini
[api]
# Timeout in seconds for task SDK → API server calls
client_connect_timeout = 10
client_read_timeout = 60 # default is often too low under load
```
## Fix 3: Check your corporate network proxy
The stack trace shows the timeout happening at the TCP read level
(`httpcore._sync.http11`), which can be caused by a proxy or firewall inserting
latency on pod-to-pod HTTP calls. If your workers and API servers communicate
through a corporate proxy, bypass it for internal traffic:
```yaml
env:
- name: NO_PROXY
value: "localhost,127.0.0.1,.cluster.local,<api-server-service-name>"
```
## Fix 4: Reduce XCom payload size
Your dynamic task uses `.map()` which generates XCom entries per mapped
instance. Large XCom payloads (lists of job descriptors) passed between tasks
increase API server response time. Consider:
- Using an XCom backend (S3/GCS) for large payloads
- Passing only IDs/keys between tasks, fetching the full data inside each task
## TL;DR
| Change | Expected impact |
|---|---|
| Increase API workers from 1 to 4+ | Eliminates timeout under load |
| Raise `client_read_timeout` | Prevents timeout for slow-but-valid responses |
| Add `NO_PROXY` for internal traffic | Fixes proxy-induced latency |
| Reduce XCom payload size | Reduces API server load long-term |
GitHub link:
https://github.com/apache/airflow/discussions/64638#discussioncomment-16503706
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]