[D] Ask for help to solve: WARNING - State of this instance has been externally set to None [airflow]

via GitHub Thu, 29 Jan 2026 22:47:49 -0800


GitHub user JerryLu991223 created a discussion: Ask for help to solve:  WARNING 
- State of this instance has been externally set to None


# Scheduler Restarts and Task Failures with High Scheduler Replica Count in 
Large-Scale Deployment

## Summary

We are experiencing frequent scheduler restarts and task terminations (SIGTERM) 
in a large-scale Airflow deployment (3500+ running slots). The issue appears to 
be related to database connection pressure and lock contention when running 
25-30 scheduler replicas. With normal replica counts (3-5), we can only 
schedule a few hundred pods, which is insufficient for our workload.

## Apache Airflow version

2.9.3

## Executor

KubernetesExecutor

## Deployment

Kubernetes (using Helm Chart)

## What happened

### Error Log

```
[2026-01-30, 02:55:47 UTC] {local_task_job_runner.py:313} WARNING - State of 
this instance has been externally set to None. Terminating instance.
[2026-01-30, 02:55:47 UTC] {taskinstance.py:2611} ERROR - Received SIGTERM. 
Terminating subprocesses.
[2026-01-30, 02:55:48 UTC] {taskinstance.py:2905} ERROR - Task failed with 
exception
Traceback (most recent call last):
  ...
  File 
"/opt/airflow/wevalflow/source_code/providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/operators/pod.py",
 line 724, in await_pod_completion
    self.pod_manager.fetch_requested_container_logs(
  ...
  File 
"/home/airflow/.local/lib/python3.12/site-packages/airflow/models/taskinstance.py",
 line 2613, in signal_handler
    raise AirflowTaskTerminated("Task received SIGTERM signal")
airflow.exceptions.AirflowTaskTerminated: Task received SIGTERM signal
```

### Symptoms

1. **Frequent Scheduler Restarts**: Schedulers restart frequently, likely due 
to database timeouts or lock contention
2. **Task Terminations**: Running tasks receive SIGTERM signals and are 
terminated when schedulers restart
3. **Poor Scheduling Efficiency**: 
   - Need 25-30 scheduler replicas to maintain 3500+ running slots
   - With 3-5 replicas (normal), only a few hundred pods can be scheduled
4. **Database Issues**: Suspected PostgreSQL slow queries and deadlocks related 
to scheduler restarts

## What you think went wrong

### Root Cause Analysis

1. **Database Connection Pool Exhaustion**
   - 25-30 scheduler replicas each require frequent database queries
   - Direct PostgreSQL connections without connection pooling (PgBouncer 
disabled)
   - Leads to connection exhaustion, slow queries, and increased deadlock 
probability

2. **Inefficient Database Queries**
   - `max_tis_per_query: 64` may be too small for large-scale deployments
   - Requires more query rounds to fetch all queued tasks
   - Increases database load and contention

3. **Scheduler Configuration Issues**
   - `parsing_processes: 4` may be insufficient
   - Slow DAG parsing affects scheduling efficiency
   - Resource allocation may not be optimal

4. **Lack of Connection Pooling**
   - PgBouncer is disabled in our configuration
   - Direct connections to PostgreSQL cannot be reused
   - Each scheduler creates new connections for each query

## How to reproduce

### Current Configuration

```yaml
scheduler:
  replicas: 20  # Need 25-30 to maintain 3500+ running slots

config:
  core:
    parallelism: 8192
    max_active_tasks_per_dag: 8192
  scheduler:
    max_tis_per_query: 256
    parsing_processes: 4
    task_queued_timeout: 31536000.0
```

Linux (Kubernetes)

## Versions of Apache Airflow Providers

- apache-airflow-providers-cncf-kubernetes (custom build based on 2.9.3)

## Deployment details

- **Kubernetes Version**: Not specified
- **Helm Chart**: Apache Airflow Helm Chart
- **Database**: PostgreSQL (RDS on Alibaba Cloud)
- **Executor**: KubernetesExecutor
- **Running Slots Target**: 3500+
- **Scheduler Replicas**: 20-30 (required to meet target)
- **Normal Replica Count**: 3-5 (only schedules a few hundred pods)



GitHub link: https://github.com/apache/airflow/discussions/61240

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

[D] Ask for help to solve: WARNING - State of this instance has been externally set to None [airflow]

Reply via email to