GitHub user dnln created a discussion: API server becomes unresponsive when viewing UI with many active DAG runs (Airflow 3.x, Kubernetes)
Over the last few weeks I’ve been debugging severe API server instability when loading the Airflow UI on Airflow 3.x. I’m running on Kubernetes using version `1.18.0` of the Helm chart. ## Problem I have a DAG that fans out using mapped tasks. At peak: * 10 active DAG runs * Each run produces 10–20 mapped tasks Airflow executes all of this successfully when the UI is not being used. However, **as soon as I load the UI and start clicking around**, the following happens: 1. The UI becomes increasingly slow/unresponsive 2. The API server pods start failing health checks (`/api/v2/version`) 3. The pods transition to `Unhealthy` 4. Kubernetes restarts them 5. Logs show no errors, they simply stop abruptly mid-request 6. CPU and memory of both the API server and scheduler stay well below limits during the event The system remains stable indefinitely as long as I do not load the UI. The issue only occurs with UI interaction. ## Workaround (but not acceptable) If I set: ``` AIRFLOW__CORE__MAX_ACTIVE_RUNS_PER_DAG=2 ``` …the UI becomes stable again, and the API server stops crashing. But this throttles my DAG to only 2 parallel runs, and I’d like the system to handle 10 runs in parallel as intended. Limiting concurrency only to make the UI load seems unnecessary. ## Question Are there recommended settings to make the API server/UI more resilient with many active DAG runs? ## What I’ve Tried (None of These Helped) ### **1. Increased PgBouncer pool sizes** ```yaml pgbouncer: maxClientConn: 500 metadataPoolSize: 50 resultBackendPoolSize: 25 ``` ### **2. Disabled SQLAlchemy pooling** ``` AIRFLOW__DATABASE__SQL_ALCHEMY_POOL_ENABLED=False AIRFLOW__DATABASE__SQL_ALCHEMY_POOL_PRE_PING=False ``` ### **3. Changed health probes** * Tried both `/api/v2/version` and `/api/v2/monitor/health` * Made liveness and readiness probes more tolerant * Increased timeouts and failure thresholds ### **4. Scaled API server significantly** * Replicas from **2 → 5** * Workers from **1 → 2** per replica * CPU from **1 → 2 cores** per replica * Memory from **2GB → 4GB** ### **5. Scaled scheduler significantly** * CPU from **1 → 4 cores** per replica * Memory from **1GB → 4GB** ### **6. Upgraded Airflow** I'm seeing this issue on the below version of Airflow: * **3.0.6** * **3.1.0** * **3.1.2** * **3.1.3** GitHub link: https://github.com/apache/airflow/discussions/58395 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
