GitHub user dnln created a discussion: API server becomes unresponsive when 
viewing UI with many active DAG runs (Airflow 3.x, Kubernetes)

Over the last few weeks I’ve been debugging severe API server instability when 
loading the Airflow UI on Airflow 3.x. I’m running on Kubernetes using version 
`1.18.0` of the Helm chart.

## Problem

I have a DAG that fans out using mapped tasks. At peak:

* 10 active DAG runs
* Each run produces 10–20 mapped tasks

Airflow executes all of this successfully when the UI is not being used.

However, **as soon as I load the UI and start clicking around**, the following 
happens:

1. The UI becomes increasingly slow/unresponsive
2. The API server pods start failing health checks (`/api/v2/version`)
3. The pods transition to `Unhealthy`
4. Kubernetes restarts them
5. Logs show no errors, they simply stop abruptly mid-request
6. CPU and memory of both the API server and scheduler stay well below limits 
during the event

The system remains stable indefinitely as long as I do not load the UI. The 
issue only occurs with UI interaction.

## Workaround (but not acceptable)

If I set:

```
AIRFLOW__CORE__MAX_ACTIVE_RUNS_PER_DAG=2
```

…the UI becomes stable again, and the API server stops crashing.

But this throttles my DAG to only 2 parallel runs, and I’d like the system to 
handle 10 runs in parallel as intended. Limiting concurrency only to make the 
UI load seems unnecessary.

## Question

Are there recommended settings to make the API server/UI more resilient with 
many active DAG runs?

## What I’ve Tried (None of These Helped)

### **1. Increased PgBouncer pool sizes**

```yaml
pgbouncer:
  maxClientConn: 500
  metadataPoolSize: 50
  resultBackendPoolSize: 25
```

### **2. Disabled SQLAlchemy pooling**

```
AIRFLOW__DATABASE__SQL_ALCHEMY_POOL_ENABLED=False
AIRFLOW__DATABASE__SQL_ALCHEMY_POOL_PRE_PING=False
```

### **3. Changed health probes**

* Tried both `/api/v2/version` and `/api/v2/monitor/health`
* Made liveness and readiness probes more tolerant
* Increased timeouts and failure thresholds

### **4. Scaled API server significantly**

* Replicas from **2 → 5**
* Workers from **1 → 2** per replica
* CPU from **1 → 2 cores** per replica
* Memory from **2GB → 4GB**

### **5. Scaled scheduler significantly**

* CPU from **1 → 4 cores** per replica
* Memory from **1GB → 4GB**

### **6. Upgraded Airflow**

I'm seeing this issue on the below version of Airflow:

* **3.0.6**
* **3.1.0**
* **3.1.2**
* **3.1.3**

GitHub link: https://github.com/apache/airflow/discussions/58395

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to