Re: [D] Kubernetes scheduler liveness probe fails during OOM conditions preventing automatic restart [airflow]

via GitHub Thu, 24 Jul 2025 00:27:05 -0700


GitHub user JinRenNg edited a comment on the discussion: Kubernetes scheduler 
liveness probe fails during OOM conditions preventing automatic restart


Thank you for the thoughtful response and questions! Let me address each point 
with specific technical details:
## Deployment Context
**Environment:**

- Kubernetes: v1.27.16 and v1.25.16
- Airflow: 2.10.5 and 2.10.3 via official Apache Helm chart
- Executor: KubernetesExecutor
- Scale: ~60 Active DAGs

## Scope of Proposed Changes
I should clarify - I'm specifically proposing this for the scheduler component 
only, not all Airflow components. Here's why:
### Scheduler-Specific Problem

- Memory pressure impact: Scheduler is most affected by memory leaks and 
prolonged operation
- Critical path: Scheduler failure stops entire pipeline execution
- Current gap: exec probes fail during memory pressure, but scheduler process 
continues running
- Kubernetes integration: Scheduler is the component most likely to be 
auto-restarted by K8s

### Other Components

- Workers: Already handle OOM well (OOMKilled → restart)
- Webserver: Already has HTTP server (could reuse existing FastAPI)
- Triggerer: Less memory-intensive, fewer reported issues

### Technical Implementation Approach
Based on your FastAPI concern, I'd like to propose a lightweight approach that 
avoids heavy dependencies:

Note: I should mention that I'm not deeply familiar with the internal structure 
of the Airflow source code, so the following represents my rough understanding 
of how this could be implemented. I'd very much appreciate guidance from the 
maintainers on the best architectural approach and would be happy to adjust the 
implementation based on your recommendations.
#### Lightweight Built-in Health Server
Please note: The following is my initial attempt at designing this based on my 
limited understanding of Airflow's architecture. I'd greatly appreciate 
feedback on whether this approach aligns with Airflow's design patterns and 
coding standards.
```
# Minimal HTTP server separate from FastAPI
import threading
from http.server import HTTPServer, BaseHTTPRequestHandler
import time

class SchedulerHealthHandler(BaseHTTPRequestHandler):
    def do_GET(self):
        if self.path == '/health':
            # Lightweight check - just verify scheduler heartbeat recency
            if hasattr(self.server, 'last_heartbeat') and \
               time.time() - self.server.last_heartbeat < 60:
                self.send_response(200)
                self.send_header('Content-type', 'text/plain')
                self.end_headers()
                self.wfile.write(b'healthy')
            else:
                self.send_response(503)
                self.send_header('Content-type', 'text/plain') 
                self.end_headers()
                self.wfile.write(b'unhealthy')

class SchedulerHealthServer:
    def __init__(self, port=8080):
        self.server = HTTPServer(('', port), SchedulerHealthHandler)
        self.server.last_heartbeat = time.time()
        
    def start(self):
        # Run in daemon thread
        thread = threading.Thread(target=self.server.serve_forever, daemon=True)
        thread.start()
        
    def update_heartbeat(self):
        self.server.last_heartbeat = time.time()
```
### Performance and Security Implications
Note: The following analysis is based on theoretical assumptions and design 
expectations. I haven't conducted actual performance testing or security 
analysis yet, so these estimates should be validated through proper testing and 
community review.
#### Performance Impact (Estimated)

- Memory: ~1MB overhead for HTTP server thread (theoretical estimate)
- CPU: Negligible (~0.01% during probe checks, estimated)
- Network: Single local port binding
- Scheduler: Zero impact on DAG processing (assumption based on daemon thread 
design)

#### Security Considerations (Theoretical)

- Network exposure: Local port only (pod-internal)
- Authentication: None needed (Kubernetes internal)
- Attack surface: Single GET endpoint, no data exposure
- Principle of least privilege: Health check only, no operational access

GitHub link: 
https://github.com/apache/airflow/discussions/53662#discussioncomment-13870335

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Re: [D] Kubernetes scheduler liveness probe fails during OOM conditions preventing automatic restart [airflow]

Reply via email to