GitHub user JinRenNg closed a discussion: Kubernetes scheduler liveness probe
fails during OOM conditions preventing automatic restart
### Incident Timeline:
1. Gradual memory increase: Scheduler memory consumption grows over days/weeks
(known memory leak behavior)
2. Memory pressure threshold: Memory usage approaches but stays below K8s limits
3. Probe execution failure: Liveness probe exec commands begin failing with
memory allocation errors
4. Silent degradation: Scheduler becomes unresponsive but Kubernetes shows
"Running" status
5. Manual intervention: Platform team must manually delete pods to restore
service
### Log Patterns Observed
```
# Kubernetes events
Warning Unhealthy pod/scheduler-xxx Liveness probe failed:
rpc error: code = Unknown desc = failed to exec in container
# Container logs show scheduler still running
[2025-07-23 09:02:30] airflow.jobs.scheduler_job - INFO - Scheduler heartbeat
[2025-07-23 09:02:45] <probe exec fails but no log in container>
# Manual exec also fails
$ kubectl exec scheduler-xxx -- airflow jobs check
OCI runtime exec failed: exec failed: unable to start container process:
error executing setns process: exit status 1: fork/exec: cannot allocate memory
```
### Monitoring Data Patterns
Memory usage: 85-95% of container limits
Process count: Normal (scheduler process still running)
CPU usage: Normal or slightly elevated
Liveness probe success rate: Drops to 0% while scheduler process remains active
### Current Behavior
1. Memory leak progression: Scheduler memory usage gradually increases over
time due to known Airflow memory leaks (see related issues #11365, #14924,
#28740)
2. Memory pressure threshold: Pod reaches high memory usage from processing
large DAGs, task queuing, or prolonged operation
3. Below-limit memory exhaustion: Memory usage stays below Kubernetes limits
(e.g., using 3.5GB of 4GB limit) but approaches system allocation limits
4. Container process continues running but system cannot spawn new processes
due to memory fragmentation/pressure
5. Liveness probe exec command fails: OCI runtime exec failed: cannot allocate
memory
6. Kubernetes logs probe failures but doesn't restart pod (process still
running)
7. Scheduler becomes unresponsive to new DAG runs but appears "healthy" to
Kubernetes
8. Manual pod deletion required to restore scheduler functionality
### Expected Behavior
Kubernetes should automatically detect unhealthy scheduler pods experiencing
OOM conditions and restart them without manual intervention, ensuring high
availability and operational reliability.
### Impact
- Production Reliability: Scheduler outages require manual intervention,
increasing MTTR
- Memory Leak Amplification: Existing memory leaks become operationally
critical due to probe failure
- Operational Overhead: Platform teams must monitor and manually restart stuck
schedulers
- Data Pipeline Availability: DAG execution halts when scheduler becomes
unresponsive
- Kubernetes Best Practices: Current implementation doesn't leverage K8s
self-healing capabilities
- Silent Failures: Unlike OOMKilled events, these failures don't generate clear
Kubernetes events
### Proposed Solutions
#### 1. HTTP Health Endpoint for Liveness Probes
Add an optional lightweight HTTP health endpoint to the scheduler:
```
# python
# In airflow/jobs/scheduler_job.py
from flask import Flask, jsonify
import threading
class SchedulerHealthServer:
def __init__(self, port=8080):
self.app = Flask(__name__)
self.app.add_url_rule('/health', 'health', self.health_check)
self.last_heartbeat = time.time()
def health_check(self):
# Lightweight check that doesn't require heavy operations
if time.time() - self.last_heartbeat < 30:
return jsonify({"status": "healthy"}), 200
return jsonify({"status": "unhealthy"}), 503
```
Helm Chart Configuration:
```
#yaml
scheduler:
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 60
periodSeconds: 30
timeoutSeconds: 5
failureThreshold: 3
```
#### 2. Enhanced Probe Configuration Options
Provide alternative probe configurations in the official Helm chart:
```
# yaml
# values.yaml
scheduler:
# Current exec-based probe (default)
livenessProbe:
enabled: true
type: "exec" # or "http"
# Exec probe configuration
exec:
command: ["airflow", "jobs", "check", "--job-type", "SchedulerJob",
"--hostname", "$(HOSTNAME)"]
# HTTP probe configuration (alternative)
httpGet:
path: /health
port: 8080
# Conservative timing for OOM scenarios
initialDelaySeconds: 120
periodSeconds: 60
timeoutSeconds: 30
failureThreshold: 3
```
### Supporting Evidence
#### Error Log Examples
```
Event: Liveness probe errored: rpc error: code = Unknown desc = failed to exec
in container:
failed to start exec
"cdbf4a3f7f1f9fabd3b5022ea399f6dbf94daed74bb8c374586a1514898eb170":
OCI runtime exec failed: exec failed: unable to start container process:
error executing setns process: exit status 1: unknown
```
#### Related Issues Research
Based on comprehensive GitHub repository analysis:
- Issue #11365: Scheduler OOM crashes documented but no liveness probe
correlation
- Issue #20644: Liveness probe failures resolved with timeout increases
- Issue #41869: Scheduler blocking on DAG processor causing probe failures
- No existing issues specifically address this OOM + liveness probe combination
#### Backward Compatibility
All proposed solutions maintain backward compatibility:
- Default behavior remains exec-based probe
- HTTP endpoint is optional (disabled by default)
- Enhanced configurations are opt-in
- No breaking changes to existing deployments
#### Implementation Priority
**High Priority** - This affects production reliability and requires manual
intervention, violating Kubernetes self-healing principles. The solution
addresses a fundamental operational gap in the current architecture.
#### Alternative Workarounds
Current mitigation strategies:
1. Resource Limits: Set memory limits to trigger OOMKilled (bypasses probe
issue)
2. External Monitoring: Deploy sidecar pods to monitor scheduler health
3. Conservative Probe Settings: Increase timeouts (doesn't solve root cause)
4. Manual Monitoring: Platform team monitoring and manual restarts
However, these workarounds don't address the fundamental design issue and
require additional operational overhead.
**Would the maintainers be open to a PR implementing the HTTP health endpoint
approach?** This seems like the most robust solution that follows Kubernetes
best practices while maintaining backward compatibility.
GitHub link: https://github.com/apache/airflow/discussions/53662
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]