GitHub user gilt-cl created a discussion: Scaling Airflow 3 on EKS — API server 
OOMs, PgBouncer saturation, and health check flakiness at 8K concurrent tasks

# Airflow 3 on EKS is way hungrier than Airflow 2 — hitting OOMs, PgBouncer 
bottlenecks, and flaky health checks at scale

We're migrating from Airflow 2.10.0 to 3.1.7 (self-managed EKS, not 
Astronomer/MWAA) and running into scaling issues during stress testing that we 
never had in Airflow 2. Our platform is fairly large — **~450 DAGs**, some with 
~200 tasks, doing about **1,500 DAG runs / 80K task instances per day**. At 
peak we're looking at **~140 concurrent DAG runs and ~8,000 tasks** running at 
the same time across a mix of Celery and KubernetesExecutor.

Would love to hear from anyone running Airflow 3 at similar scale.

## Our setup

- **Airflow 3.1.7**, Helm chart 1.18.0, Python 3.12
- **Executor:** hybrid `CeleryExecutor,KubernetesExecutor`
- **Infra:** AWS EKS on Graviton4 ARM64 nodes (c8g.2xlarge, m8g.2xlarge, 
x8g.2xlarge)
- **Database:** RDS PostgreSQL db.m7g.2xlarge (8 vCPU / 32 GiB) behind PgBouncer
- **XCom backend:** custom S3 backend (`S3XComBackend`)
- **Autoscaling:** KEDA for Celery workers and triggerer

### Current stress-test topology


| Component      | Replicas | Memory            | Notes                         
                                                     |
| -------------- | -------- | ----------------- | 
----------------------------------------------------------------------------------
 |
| API Server     | 3        | 8Gi               | 6 Uvicorn workers each (18 
total)                                                  |
| Scheduler      | 2        | 8Gi               | Had to drop from 4 due to 
[#57618](https://github.com/apache/airflow/issues/57618) |
| DagProcessor   | 2        | 3Gi               | Standalone, 8 parsing 
processes                                                    |
| Triggerer      | 1+       | KEDA-scaled       |                               
                                                     |
| Celery Workers | 2–64     | 16Gi              | KEDA-scaled, 
`worker_concurrency: 16`                                              |
| PgBouncer      | 1        | 512Mi / 1000m CPU | `metadataPoolSize: 500`, 
`maxClientConn: 5000`                                     |


Key config:

```ini
AIRFLOW__CORE__PARALLELISM = 2048
AIRFLOW__SCHEDULER__MAX_TIS_PER_QUERY = 512
AIRFLOW__SCHEDULER__JOB_HEARTBEAT_SEC = 5           # was 2 in Airflow 2
AIRFLOW__SCHEDULER__SCHEDULER_HEARTBEAT_SEC = 5     # was 2
AIRFLOW__SCHEDULER__SCHEDULER_HEALTH_CHECK_THRESHOLD = 60
AIRFLOW__KUBERNETES_EXECUTOR__WORKER_PODS_CREATION_BATCH_SIZE = 32
AIRFLOW__OPERATORS__DEFAULT_DEFERRABLE = True
```

We also had to relax liveness probes across the board (`timeoutSeconds: 60`, 
`failureThreshold: 10`) and extend the API server startup probe to 5 minutes — 
the Helm chart defaults were way too aggressive for our load.

One thing worth calling out: we never set CPU requests/limits on the API 
server, scheduler, or DagProcessor. We got away with that in Airflow 2, but it 
matters a lot more now that the API server handles execution traffic too.

---

## What's going wrong

### 1. API server keeps getting OOMKilled

This is the big one. Under load, the API server pods hit their memory limit and 
get killed (exit code 137). We first saw this with just ~50 DAG runs and 
150–200 concurrent tasks — nowhere near our production load.

Here's what we're seeing:

- Each Uvicorn worker sits at ~800Mi–1Gi under load
- Memory usage correlates with the number of KubernetesExecutor pods, not UI 
traffic
- When execution traffic overwhelms the API server, the UI goes down with it 
(503s)

Our best guess: Airflow 3 serves both the Core API (UI, REST) and the Execution 
API (task heartbeats, XCom pushes, state transitions) on the same Uvicorn 
workers. So when hundreds of worker pods are hammering the API server with 
heartbeats and XCom data, it creates memory pressure that takes down everything 
— including the UI.

We saw [#58395](https://github.com/apache/airflow/discussions/58395) which 
describes something similar (fixed in 3.1.5 via DB query fixes). We're on 3.1.7 
and still hitting it — our issue seems more about raw request volume than query 
inefficiency.

### 2. PgBouncer is the bottleneck

With 64 Celery workers + hundreds of K8s executor pods + schedulers + API 
servers + DagProcessors all going through a **single PgBouncer pod**, the 
connection pool gets saturated:

- Liveness probes (`airflow jobs check`) queue up waiting for a DB connection
- Heartbeat writes get delayed 30–60 seconds
- KEDA's PostgreSQL trigger fails with `"connection refused"` when PgBouncer is 
overloaded
- The UI reports components as unhealthy because heartbeat timestamps go stale

We've already bumped pool sizes from the defaults (`metadataPoolSize: 10`, 
`maxClientConn: 100`) up to `500` / `5000`, but it still saturates at peak.

**One thing I really want to understand:** with AIP-72 in Airflow 3, are 
KubernetesExecutor worker pods still connecting directly to the metadata DB 
through PgBouncer? The pod template still includes `SQL_ALCHEMY_CONN` and the 
init containers still run `airflow db check`. 
[#60271](https://github.com/apache/airflow/issues/60271) seems to track this. 
If every K8s executor pod is opening its own PgBouncer connection, that would 
explain why our pool is exhausted.

### 3. API server takes forever to start

Each Uvicorn worker independently loads the full Airflow stack — FastAPI 
routes, providers, plugins, DAG parsing init, DB connection pools. With 6 
workers, startup takes **4+ minutes**. The Helm chart default startup probe 
(60s) is nowhere close to enough, and rolling deployments are painfully slow 
because of it.

### 4. False-positive health check failures

Even with `SCHEDULER_HEALTH_CHECK_THRESHOLD=60`, the UI flags components as 
unhealthy during peak load. They're actually fine — they just can't write 
heartbeats fast enough because PgBouncer is contended:

```
Triggerer: "Heartbeat recovered after 33.94 seconds"
DagProcessor: "Heartbeat recovered after 29.29 seconds"
```

---

## What we'd like help with

Given our scale (450 DAGs, 8K concurrent tasks at peak, 80K daily), any 
guidance on these would be great:

1. **Sizing and topology** — What should the API server, scheduler, and worker 
setup look like at this scale? How many replicas, how many workers per replica, 
and what CPU/memory requests make sense? We've never set CPU requests on 
anything and we're starting to think that's a big gap.
2. **PgBouncer** — Is a single replica realistic at this scale, or should we 
run multiple? What pool sizes have worked for others? And the big question: do 
K8s executor pods still hit the DB directly in 3.1.7, or does everything go 
through the Execution API now? 
([#60271](https://github.com/apache/airflow/issues/60271))
3. **General lessons learned** — If you've migrated a large-scale self-hosted 
Airflow 2 setup to Airflow 3, what do you wish you'd known going in?

---

## What we've already tried

- Bumped API server memory from 3Gi → 8Gi and added a third replica
- Increased PgBouncer pool sizes from defaults to 500/5000, added CPU requests
- Relaxed liveness probes everywhere (timeouts 20s → 60s, thresholds 5 → 10)
- Bumped health check threshold (30 → 60) and heartbeat intervals (2s → 5s)
- Removed `cluster-autoscaler.kubernetes.io/safe-to-evict: "true"` from the API 
server (was causing premature eviction)
- Doubled `WORKER_PODS_CREATION_BATCH_SIZE` (16 → 32) and `parallelism` (1024 → 
2048)
- Extended API server startup probe to 5 minutes
- Added `max_prepared_statements = 100` to PgBouncer (fixed KEDA prepared 
statement errors)

## Airflow 2 vs 3 — what changed

For context, here's a summary of the differences between our Airflow 2 
production setup and what we've had to do for Airflow 3. The general trend is 
that everything needs more resources and more tolerance for slowness:


| Area                          | Airflow 2.10.0    | Airflow 3.1.7             
          | Why                                                                 
                       |
| ----------------------------- | ----------------- | 
----------------------------------- | 
------------------------------------------------------------------------------------------
 |
| Scheduler memory              | 2–4Gi             | 8Gi                       
          | Scheduler is doing more work                                        
                       |
| Webserver → API server memory | 3Gi               | 6–8Gi                     
          | API server is much heavier than the old Flask webserver             
                       |
| Worker memory                 | 8Gi               | 12–16Gi                   
          |                                                                     
                       |
| Celery concurrency            | 16                | 12–16                     
          | Reduced in smaller envs                                             
                       |
| PgBouncer pools               | 1000 / 500 / 5000 | 100 / 50 / 2000 (base), 
500 in prod | Reduced for shared-RDS safety; prod overrides                     
                         |
| Parallelism                   | 64–1024           | 192–2048                  
          | Roughly 2x across all envs                                          
                       |
| Scheduler replicas (prod)     | 4                 | 2                         
          | KubernetesExecutor race condition 
[#57618](https://github.com/apache/airflow/issues/57618) |
| Liveness probe timeouts       | 20s               | 60s                       
          | DB contention makes probes slow                                     
                       |
| API server startup            | ~30s              | ~4 min                    
          | Uvicorn workers load the full stack sequentially                    
                       |
| CPU requests                  | Never set         | Still not set             
          | Planning to add — probably a big gap                                
                       |


---

Happy to share Helm values, logs, or whatever else would help. Would really 
appreciate hearing from anyone dealing with similar stuff.

GitHub link: https://github.com/apache/airflow/discussions/62117

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to