GitHub user gilt-cl created a discussion: Scaling Airflow 3 on EKS — API server
OOMs, PgBouncer saturation, and health check flakiness at 8K concurrent tasks
# Airflow 3 on EKS is way hungrier than Airflow 2 — hitting OOMs, PgBouncer
bottlenecks, and flaky health checks at scale
We're migrating from Airflow 2.10.0 to 3.1.7 (self-managed EKS, not
Astronomer/MWAA) and running into scaling issues during stress testing that we
never had in Airflow 2. Our platform is fairly large — **~450 DAGs**, some with
~200 tasks, doing about **1,500 DAG runs / 80K task instances per day**. At
peak we're looking at **~140 concurrent DAG runs and ~8,000 tasks** running at
the same time across a mix of Celery and KubernetesExecutor.
Would love to hear from anyone running Airflow 3 at similar scale.
## Our setup
- **Airflow 3.1.7**, Helm chart 1.18.0, Python 3.12
- **Executor:** hybrid `CeleryExecutor,KubernetesExecutor`
- **Infra:** AWS EKS on Graviton4 ARM64 nodes (c8g.2xlarge, m8g.2xlarge,
x8g.2xlarge)
- **Database:** RDS PostgreSQL db.m7g.2xlarge (8 vCPU / 32 GiB) behind PgBouncer
- **XCom backend:** custom S3 backend (`S3XComBackend`)
- **Autoscaling:** KEDA for Celery workers and triggerer
### Current stress-test topology
| Component | Replicas | Memory | Notes
|
| -------------- | -------- | ----------------- |
----------------------------------------------------------------------------------
|
| API Server | 3 | 8Gi | 6 Uvicorn workers each (18
total) |
| Scheduler | 2 | 8Gi | Had to drop from 4 due to
[#57618](https://github.com/apache/airflow/issues/57618) |
| DagProcessor | 2 | 3Gi | Standalone, 8 parsing
processes |
| Triggerer | 1+ | KEDA-scaled |
|
| Celery Workers | 2–64 | 16Gi | KEDA-scaled,
`worker_concurrency: 16` |
| PgBouncer | 1 | 512Mi / 1000m CPU | `metadataPoolSize: 500`,
`maxClientConn: 5000` |
Key config:
```ini
AIRFLOW__CORE__PARALLELISM = 2048
AIRFLOW__SCHEDULER__MAX_TIS_PER_QUERY = 512
AIRFLOW__SCHEDULER__JOB_HEARTBEAT_SEC = 5 # was 2 in Airflow 2
AIRFLOW__SCHEDULER__SCHEDULER_HEARTBEAT_SEC = 5 # was 2
AIRFLOW__SCHEDULER__SCHEDULER_HEALTH_CHECK_THRESHOLD = 60
AIRFLOW__KUBERNETES_EXECUTOR__WORKER_PODS_CREATION_BATCH_SIZE = 32
AIRFLOW__OPERATORS__DEFAULT_DEFERRABLE = True
```
We also had to relax liveness probes across the board (`timeoutSeconds: 60`,
`failureThreshold: 10`) and extend the API server startup probe to 5 minutes —
the Helm chart defaults were way too aggressive for our load.
One thing worth calling out: we never set CPU requests/limits on the API
server, scheduler, or DagProcessor. We got away with that in Airflow 2, but it
matters a lot more now that the API server handles execution traffic too.
---
## What's going wrong
### 1. API server keeps getting OOMKilled
This is the big one. Under load, the API server pods hit their memory limit and
get killed (exit code 137). We first saw this with just ~50 DAG runs and
150–200 concurrent tasks — nowhere near our production load.
Here's what we're seeing:
- Each Uvicorn worker sits at ~800Mi–1Gi under load
- Memory usage correlates with the number of KubernetesExecutor pods, not UI
traffic
- When execution traffic overwhelms the API server, the UI goes down with it
(503s)
Our best guess: Airflow 3 serves both the Core API (UI, REST) and the Execution
API (task heartbeats, XCom pushes, state transitions) on the same Uvicorn
workers. So when hundreds of worker pods are hammering the API server with
heartbeats and XCom data, it creates memory pressure that takes down everything
— including the UI.
We saw [#58395](https://github.com/apache/airflow/discussions/58395) which
describes something similar (fixed in 3.1.5 via DB query fixes). We're on 3.1.7
and still hitting it — our issue seems more about raw request volume than query
inefficiency.
### 2. PgBouncer is the bottleneck
With 64 Celery workers + hundreds of K8s executor pods + schedulers + API
servers + DagProcessors all going through a **single PgBouncer pod**, the
connection pool gets saturated:
- Liveness probes (`airflow jobs check`) queue up waiting for a DB connection
- Heartbeat writes get delayed 30–60 seconds
- KEDA's PostgreSQL trigger fails with `"connection refused"` when PgBouncer is
overloaded
- The UI reports components as unhealthy because heartbeat timestamps go stale
We've already bumped pool sizes from the defaults (`metadataPoolSize: 10`,
`maxClientConn: 100`) up to `500` / `5000`, but it still saturates at peak.
**One thing I really want to understand:** with AIP-72 in Airflow 3, are
KubernetesExecutor worker pods still connecting directly to the metadata DB
through PgBouncer? The pod template still includes `SQL_ALCHEMY_CONN` and the
init containers still run `airflow db check`.
[#60271](https://github.com/apache/airflow/issues/60271) seems to track this.
If every K8s executor pod is opening its own PgBouncer connection, that would
explain why our pool is exhausted.
### 3. API server takes forever to start
Each Uvicorn worker independently loads the full Airflow stack — FastAPI
routes, providers, plugins, DAG parsing init, DB connection pools. With 6
workers, startup takes **4+ minutes**. The Helm chart default startup probe
(60s) is nowhere close to enough, and rolling deployments are painfully slow
because of it.
### 4. False-positive health check failures
Even with `SCHEDULER_HEALTH_CHECK_THRESHOLD=60`, the UI flags components as
unhealthy during peak load. They're actually fine — they just can't write
heartbeats fast enough because PgBouncer is contended:
```
Triggerer: "Heartbeat recovered after 33.94 seconds"
DagProcessor: "Heartbeat recovered after 29.29 seconds"
```
---
## What we'd like help with
Given our scale (450 DAGs, 8K concurrent tasks at peak, 80K daily), any
guidance on these would be great:
1. **Sizing and topology** — What should the API server, scheduler, and worker
setup look like at this scale? How many replicas, how many workers per replica,
and what CPU/memory requests make sense? We've never set CPU requests on
anything and we're starting to think that's a big gap.
2. **PgBouncer** — Is a single replica realistic at this scale, or should we
run multiple? What pool sizes have worked for others? And the big question: do
K8s executor pods still hit the DB directly in 3.1.7, or does everything go
through the Execution API now?
([#60271](https://github.com/apache/airflow/issues/60271))
3. **General lessons learned** — If you've migrated a large-scale self-hosted
Airflow 2 setup to Airflow 3, what do you wish you'd known going in?
---
## What we've already tried
- Bumped API server memory from 3Gi → 8Gi and added a third replica
- Increased PgBouncer pool sizes from defaults to 500/5000, added CPU requests
- Relaxed liveness probes everywhere (timeouts 20s → 60s, thresholds 5 → 10)
- Bumped health check threshold (30 → 60) and heartbeat intervals (2s → 5s)
- Removed `cluster-autoscaler.kubernetes.io/safe-to-evict: "true"` from the API
server (was causing premature eviction)
- Doubled `WORKER_PODS_CREATION_BATCH_SIZE` (16 → 32) and `parallelism` (1024 →
2048)
- Extended API server startup probe to 5 minutes
- Added `max_prepared_statements = 100` to PgBouncer (fixed KEDA prepared
statement errors)
## Airflow 2 vs 3 — what changed
For context, here's a summary of the differences between our Airflow 2
production setup and what we've had to do for Airflow 3. The general trend is
that everything needs more resources and more tolerance for slowness:
| Area | Airflow 2.10.0 | Airflow 3.1.7
| Why
|
| ----------------------------- | ----------------- |
----------------------------------- |
------------------------------------------------------------------------------------------
|
| Scheduler memory | 2–4Gi | 8Gi
| Scheduler is doing more work
|
| Webserver → API server memory | 3Gi | 6–8Gi
| API server is much heavier than the old Flask webserver
|
| Worker memory | 8Gi | 12–16Gi
|
|
| Celery concurrency | 16 | 12–16
| Reduced in smaller envs
|
| PgBouncer pools | 1000 / 500 / 5000 | 100 / 50 / 2000 (base),
500 in prod | Reduced for shared-RDS safety; prod overrides
|
| Parallelism | 64–1024 | 192–2048
| Roughly 2x across all envs
|
| Scheduler replicas (prod) | 4 | 2
| KubernetesExecutor race condition
[#57618](https://github.com/apache/airflow/issues/57618) |
| Liveness probe timeouts | 20s | 60s
| DB contention makes probes slow
|
| API server startup | ~30s | ~4 min
| Uvicorn workers load the full stack sequentially
|
| CPU requests | Never set | Still not set
| Planning to add — probably a big gap
|
---
Happy to share Helm values, logs, or whatever else would help. Would really
appreciate hearing from anyone dealing with similar stuff.
GitHub link: https://github.com/apache/airflow/discussions/62117
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]