GitHub user JinRenNg edited a comment on the discussion: Kubernetes scheduler
liveness probe fails during OOM conditions preventing automatic restart
Thank you for the thoughtful response and questions! Let me address each point
with specific technical details:
## Deployment Context
**Environment:**
- Kubernetes: v1.27.16 and v1.25.16
- Airflow: 2.10.5 and 2.10.3 via official Apache Helm chart
- Executor: KubernetesExecutor
- Scale: ~60 Active DAGs
## Scope of Proposed Changes
I should clarify - I'm specifically proposing this for the scheduler component
only, not all Airflow components. Here's why:
### Scheduler-Specific Problem
- Memory pressure impact: Scheduler is most affected by memory leaks and
prolonged operation
- Critical path: Scheduler failure stops entire pipeline execution
- Current gap: exec probes fail during memory pressure, but scheduler process
continues running
- Kubernetes integration: Scheduler is the component most likely to be
auto-restarted by K8s
### Other Components
- Workers: Already handle OOM well (OOMKilled → restart)
- Webserver: Already has HTTP server (could reuse existing FastAPI)
- Triggerer: Less memory-intensive, fewer reported issues
### Technical Implementation Approach
Based on your FastAPI concern, I'd like to propose a lightweight approach that
avoids heavy dependencies:
Note: I should mention that I'm not deeply familiar with the internal structure
of the Airflow source code, so the following represents my rough understanding
of how this could be implemented. I'd very much appreciate guidance from the
maintainers on the best architectural approach and would be happy to adjust the
implementation based on your recommendations.
#### Lightweight Built-in Health Server
Please note: The following is my initial attempt at designing this based on my
limited understanding of Airflow's architecture. I'd greatly appreciate
feedback on whether this approach aligns with Airflow's design patterns and
coding standards.
```
# Minimal HTTP server separate from FastAPI
import threading
from http.server import HTTPServer, BaseHTTPRequestHandler
import time
class SchedulerHealthHandler(BaseHTTPRequestHandler):
def do_GET(self):
if self.path == '/health':
# Lightweight check - just verify scheduler heartbeat recency
if hasattr(self.server, 'last_heartbeat') and \
time.time() - self.server.last_heartbeat < 60:
self.send_response(200)
self.send_header('Content-type', 'text/plain')
self.end_headers()
self.wfile.write(b'healthy')
else:
self.send_response(503)
self.send_header('Content-type', 'text/plain')
self.end_headers()
self.wfile.write(b'unhealthy')
class SchedulerHealthServer:
def __init__(self, port=8080):
self.server = HTTPServer(('', port), SchedulerHealthHandler)
self.server.last_heartbeat = time.time()
def start(self):
# Run in daemon thread
thread = threading.Thread(target=self.server.serve_forever, daemon=True)
thread.start()
def update_heartbeat(self):
self.server.last_heartbeat = time.time()
```
### Performance and Security Implications
Note: The following analysis is based on theoretical assumptions and design
expectations. I haven't conducted actual performance testing or security
analysis yet, so these estimates should be validated through proper testing and
community review.
#### Performance Impact (Estimated)
- Memory: ~1MB overhead for HTTP server thread (theoretical estimate)
- CPU: Negligible (~0.01% during probe checks, estimated)
- Network: Single local port binding
- Scheduler: Zero impact on DAG processing (assumption based on daemon thread
design)
#### Security Considerations (Theoretical)
- Network exposure: Local port only (pod-internal)
- Authentication: None needed (Kubernetes internal)
- Attack surface: Single GET endpoint, no data exposure
- Principle of least privilege: Health check only, no operational access
GitHub link:
https://github.com/apache/airflow/discussions/53662#discussioncomment-13870335
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]