GitHub user Urus1201 added a comment to the discussion: Airflow Dynamic Task 
Fails with httpx time out on an Instance with high load and around 300 active 
DAGs.

The root cause is that your **API servers are heavily under-provisioned** for a 
300-DAG, 8-scheduler setup. The Airflow 3 task SDK communicates with the API 
server for XCom reads (and other task operations) — when the API server is 
saturated, those calls time out.

## Why this works on dev but not prod

In Airflow 3, tasks fetch XComs by calling the REST API 
(`/xcoms/{dag_id}/{run_id}/{task_id}/{key}`) rather than reading from the 
metastore directly. With 8 schedulers and 300 active DAGs all running tasks 
simultaneously, your 2 API server pods with **1 worker each** = **2 total 
worker processes** handling all API traffic. That is far too few.

## Fix 1: Increase API server workers (highest impact)

In your Helm chart values:

```yaml
apiServer:
  replicas: 2
  workers: 4   # Increase from 1 to 4 (or more)
```

Or if using gunicorn-based config:
```ini
[api_server]
workers = 4
worker_timeout = 120
```

This alone will likely resolve the timeout.

## Fix 2: Increase the SDK client timeout

Set a higher timeout for the task SDK's HTTP client:

```ini
[api]
# Timeout in seconds for task SDK → API server calls
client_connect_timeout = 10
client_read_timeout = 60   # default is often too low under load
```

## Fix 3: Check your corporate network proxy

The stack trace shows the timeout happening at the TCP read level 
(`httpcore._sync.http11`), which can be caused by a proxy or firewall inserting 
latency on pod-to-pod HTTP calls. If your workers and API servers communicate 
through a corporate proxy, bypass it for internal traffic:

```yaml
env:
  - name: NO_PROXY
    value: "localhost,127.0.0.1,.cluster.local,<api-server-service-name>"
```

## Fix 4: Reduce XCom payload size

Your dynamic task uses `.map()` which generates XCom entries per mapped 
instance. Large XCom payloads (lists of job descriptors) passed between tasks 
increase API server response time. Consider:
- Using an XCom backend (S3/GCS) for large payloads
- Passing only IDs/keys between tasks, fetching the full data inside each task

## TL;DR

| Change | Expected impact |
|---|---|
| Increase API workers from 1 to 4+ | Eliminates timeout under load |
| Raise `client_read_timeout` | Prevents timeout for slow-but-valid responses |
| Add `NO_PROXY` for internal traffic | Fixes proxy-induced latency |
| Reduce XCom payload size | Reduces API server load long-term |

GitHub link: 
https://github.com/apache/airflow/discussions/64638#discussioncomment-16503706

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to