antonio-mello-ai opened a new pull request, #63583:
URL: https://github.com/apache/airflow/pull/63583

   ## Summary
   
   - Add new `airflow celery worker-health-check` CLI command that detects the 
"catatonic worker" state where a Celery worker is alive but has silently lost 
its queue consumer registration after a Redis broker restart
   - The existing health check (`celery inspect ping`) does not detect this 
state — the worker responds to ping but consumes no tasks
   - The new command performs a two-stage check: (1) `inspect.ping()` to verify 
the worker is alive, then (2) `inspect.active_queues()` to verify it has 
registered queue consumers
   - Update the official docker-compose health check to use the new command
   
   ## Root Cause
   
   This addresses a known upstream Celery bug (celery/celery#8030, 
celery/celery#9054, celery/celery#8990) where after a Redis broker restart, the 
worker reconnects at the transport level but fails to re-register its consumer 
on the queue. The worker process stays alive, `celery inspect ping` returns OK, 
but `inspect.active_queues()` returns `None`. Tasks pile up in the Redis queue 
and are never consumed.
   
   The partial fix in celery/celery#8796 did not fully resolve the issue, and 
it persists across Celery 5.2.x through 5.5.x with Redis broker.
   
   This PR takes a **defensive approach on Airflow's side** — instead of 
waiting for an upstream fix, we improve the health check to detect and recover 
from the catatonic state by triggering a container restart.
   
   ## Changes
   
   - **`celery_command.py`**: New `worker_health_check()` function with 
two-stage verification (ping + active_queues)
   - **`definition.py`**: Register `worker-health-check` CLI command with 
optional `-H` hostname argument
   - **`docker-compose.yaml`**: Update worker health check from `celery inspect 
ping` to `airflow celery worker-health-check`
   - **`test_celery_command.py`**: 7 new tests covering all scenarios (healthy, 
ping failure, catatonic states, auto-hostname)
   
   ## Test Plan
   
   - [x] Worker healthy (ping + queues OK) → exits 0
   - [x] Ping returns None → exits non-zero
   - [x] Worker absent from ping result → exits non-zero  
   - [x] Catatonic: ping OK but active_queues returns None → exits non-zero
   - [x] Catatonic: worker absent from active_queues → exits non-zero
   - [x] Catatonic: empty queue list → exits non-zero
   - [x] Auto-resolves hostname via `socket.gethostname()` when `-H` not 
provided
   - [x] All 7 new tests passing
   - [x] All pre-commit hooks pass
   
   Closes #63580
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)
   
   Co-Authored-By: Claude Opus 4.6 <[email protected]>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to