GitHub user JerryLu991223 created a discussion: Ask for help to solve: WARNING
- State of this instance has been externally set to None
# Scheduler Restarts and Task Failures with High Scheduler Replica Count in
Large-Scale Deployment
## Summary
We are experiencing frequent scheduler restarts and task terminations (SIGTERM)
in a large-scale Airflow deployment (3500+ running slots). The issue appears to
be related to database connection pressure and lock contention when running
25-30 scheduler replicas. With normal replica counts (3-5), we can only
schedule a few hundred pods, which is insufficient for our workload.
## Apache Airflow version
2.9.3
## Executor
KubernetesExecutor
## Deployment
Kubernetes (using Helm Chart)
## What happened
### Error Log
```
[2026-01-30, 02:55:47 UTC] {local_task_job_runner.py:313} WARNING - State of
this instance has been externally set to None. Terminating instance.
[2026-01-30, 02:55:47 UTC] {taskinstance.py:2611} ERROR - Received SIGTERM.
Terminating subprocesses.
[2026-01-30, 02:55:48 UTC] {taskinstance.py:2905} ERROR - Task failed with
exception
Traceback (most recent call last):
...
File
"/opt/airflow/wevalflow/source_code/providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/operators/pod.py",
line 724, in await_pod_completion
self.pod_manager.fetch_requested_container_logs(
...
File
"/home/airflow/.local/lib/python3.12/site-packages/airflow/models/taskinstance.py",
line 2613, in signal_handler
raise AirflowTaskTerminated("Task received SIGTERM signal")
airflow.exceptions.AirflowTaskTerminated: Task received SIGTERM signal
```
### Symptoms
1. **Frequent Scheduler Restarts**: Schedulers restart frequently, likely due
to database timeouts or lock contention
2. **Task Terminations**: Running tasks receive SIGTERM signals and are
terminated when schedulers restart
3. **Poor Scheduling Efficiency**:
- Need 25-30 scheduler replicas to maintain 3500+ running slots
- With 3-5 replicas (normal), only a few hundred pods can be scheduled
4. **Database Issues**: Suspected PostgreSQL slow queries and deadlocks related
to scheduler restarts
## What you think went wrong
### Root Cause Analysis
1. **Database Connection Pool Exhaustion**
- 25-30 scheduler replicas each require frequent database queries
- Direct PostgreSQL connections without connection pooling (PgBouncer
disabled)
- Leads to connection exhaustion, slow queries, and increased deadlock
probability
2. **Inefficient Database Queries**
- `max_tis_per_query: 64` may be too small for large-scale deployments
- Requires more query rounds to fetch all queued tasks
- Increases database load and contention
3. **Scheduler Configuration Issues**
- `parsing_processes: 4` may be insufficient
- Slow DAG parsing affects scheduling efficiency
- Resource allocation may not be optimal
4. **Lack of Connection Pooling**
- PgBouncer is disabled in our configuration
- Direct connections to PostgreSQL cannot be reused
- Each scheduler creates new connections for each query
## How to reproduce
### Current Configuration
```yaml
scheduler:
replicas: 20 # Need 25-30 to maintain 3500+ running slots
config:
core:
parallelism: 8192
max_active_tasks_per_dag: 8192
scheduler:
max_tis_per_query: 256
parsing_processes: 4
task_queued_timeout: 31536000.0
```
Linux (Kubernetes)
## Versions of Apache Airflow Providers
- apache-airflow-providers-cncf-kubernetes (custom build based on 2.9.3)
## Deployment details
- **Kubernetes Version**: Not specified
- **Helm Chart**: Apache Airflow Helm Chart
- **Database**: PostgreSQL (RDS on Alibaba Cloud)
- **Executor**: KubernetesExecutor
- **Running Slots Target**: 3500+
- **Scheduler Replicas**: 20-30 (required to meet target)
- **Normal Replica Count**: 3-5 (only schedules a few hundred pods)
GitHub link: https://github.com/apache/airflow/discussions/61240
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]