GitHub user JinRenNg closed a discussion: Kubernetes scheduler liveness probe 
fails during OOM conditions preventing automatic restart

### Incident Timeline:
1. Gradual memory increase: Scheduler memory consumption grows over days/weeks 
(known memory leak behavior)
2. Memory pressure threshold: Memory usage approaches but stays below K8s limits
3. Probe execution failure: Liveness probe exec commands begin failing with 
memory allocation errors
4. Silent degradation: Scheduler becomes unresponsive but Kubernetes shows 
"Running" status
5. Manual intervention: Platform team must manually delete pods to restore 
service
### Log Patterns Observed
```
# Kubernetes events
Warning  Unhealthy  pod/scheduler-xxx  Liveness probe failed: 
  rpc error: code = Unknown desc = failed to exec in container

# Container logs show scheduler still running
[2025-07-23 09:02:30] airflow.jobs.scheduler_job - INFO - Scheduler heartbeat
[2025-07-23 09:02:45] <probe exec fails but no log in container>

# Manual exec also fails
$ kubectl exec scheduler-xxx -- airflow jobs check
OCI runtime exec failed: exec failed: unable to start container process: 
error executing setns process: exit status 1: fork/exec: cannot allocate memory
```

### Monitoring Data Patterns

Memory usage: 85-95% of container limits
Process count: Normal (scheduler process still running)
CPU usage: Normal or slightly elevated
Liveness probe success rate: Drops to 0% while scheduler process remains active

### Current Behavior

1. Memory leak progression: Scheduler memory usage gradually increases over 
time due to known Airflow memory leaks (see related issues #11365, #14924, 
#28740)
2. Memory pressure threshold: Pod reaches high memory usage from processing 
large DAGs, task queuing, or prolonged operation
3. Below-limit memory exhaustion: Memory usage stays below Kubernetes limits 
(e.g., using 3.5GB of 4GB limit) but approaches system allocation limits
4. Container process continues running but system cannot spawn new processes 
due to memory fragmentation/pressure
5. Liveness probe exec command fails: OCI runtime exec failed: cannot allocate 
memory
6. Kubernetes logs probe failures but doesn't restart pod (process still 
running)
7. Scheduler becomes unresponsive to new DAG runs but appears "healthy" to 
Kubernetes
8. Manual pod deletion required to restore scheduler functionality

### Expected Behavior
Kubernetes should automatically detect unhealthy scheduler pods experiencing 
OOM conditions and restart them without manual intervention, ensuring high 
availability and operational reliability.
### Impact

- Production Reliability: Scheduler outages require manual intervention, 
increasing MTTR
- Memory Leak Amplification: Existing memory leaks become operationally 
critical due to probe failure
- Operational Overhead: Platform teams must monitor and manually restart stuck 
schedulers
- Data Pipeline Availability: DAG execution halts when scheduler becomes 
unresponsive
- Kubernetes Best Practices: Current implementation doesn't leverage K8s 
self-healing capabilities
- Silent Failures: Unlike OOMKilled events, these failures don't generate clear 
Kubernetes events

### Proposed Solutions
#### 1. HTTP Health Endpoint for Liveness Probes
Add an optional lightweight HTTP health endpoint to the scheduler:
```
# python
# In airflow/jobs/scheduler_job.py
from flask import Flask, jsonify
import threading

class SchedulerHealthServer:
    def __init__(self, port=8080):
        self.app = Flask(__name__)
        self.app.add_url_rule('/health', 'health', self.health_check)
        self.last_heartbeat = time.time()
        
    def health_check(self):
        # Lightweight check that doesn't require heavy operations
        if time.time() - self.last_heartbeat < 30:
            return jsonify({"status": "healthy"}), 200
        return jsonify({"status": "unhealthy"}), 503
```

Helm Chart Configuration:
```
#yaml
scheduler:
  livenessProbe:
    httpGet:
      path: /health
      port: 8080
    initialDelaySeconds: 60
    periodSeconds: 30
    timeoutSeconds: 5
    failureThreshold: 3
```

#### 2. Enhanced Probe Configuration Options
Provide alternative probe configurations in the official Helm chart:
```
# yaml
# values.yaml
scheduler:
  # Current exec-based probe (default)
  livenessProbe:
    enabled: true
    type: "exec"  # or "http"
    
    # Exec probe configuration
    exec:
      command: ["airflow", "jobs", "check", "--job-type", "SchedulerJob", 
"--hostname", "$(HOSTNAME)"]
    
    # HTTP probe configuration (alternative)
    httpGet:
      path: /health
      port: 8080
      
    # Conservative timing for OOM scenarios
    initialDelaySeconds: 120
    periodSeconds: 60
    timeoutSeconds: 30
    failureThreshold: 3
```
### Supporting Evidence
#### Error Log Examples
```
Event: Liveness probe errored: rpc error: code = Unknown desc = failed to exec 
in container: 
failed to start exec 
"cdbf4a3f7f1f9fabd3b5022ea399f6dbf94daed74bb8c374586a1514898eb170": 
OCI runtime exec failed: exec failed: unable to start container process: 
error executing setns process: exit status 1: unknown
```
#### Related Issues Research
Based on comprehensive GitHub repository analysis:

- Issue #11365: Scheduler OOM crashes documented but no liveness probe 
correlation
- Issue #20644: Liveness probe failures resolved with timeout increases
- Issue #41869: Scheduler blocking on DAG processor causing probe failures
- No existing issues specifically address this OOM + liveness probe combination

#### Backward Compatibility
All proposed solutions maintain backward compatibility:

- Default behavior remains exec-based probe
- HTTP endpoint is optional (disabled by default)
- Enhanced configurations are opt-in
- No breaking changes to existing deployments

#### Implementation Priority
**High Priority** - This affects production reliability and requires manual 
intervention, violating Kubernetes self-healing principles. The solution 
addresses a fundamental operational gap in the current architecture.
#### Alternative Workarounds
Current mitigation strategies:
1. Resource Limits: Set memory limits to trigger OOMKilled (bypasses probe 
issue)
2. External Monitoring: Deploy sidecar pods to monitor scheduler health
3. Conservative Probe Settings: Increase timeouts (doesn't solve root cause)
4. Manual Monitoring: Platform team monitoring and manual restarts

However, these workarounds don't address the fundamental design issue and 
require additional operational overhead.

**Would the maintainers be open to a PR implementing the HTTP health endpoint 
approach?** This seems like the most robust solution that follows Kubernetes 
best practices while maintaining backward compatibility.

GitHub link: https://github.com/apache/airflow/discussions/53662

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to