antonio-mello-ai opened a new pull request, #63583: URL: https://github.com/apache/airflow/pull/63583
## Summary - Add new `airflow celery worker-health-check` CLI command that detects the "catatonic worker" state where a Celery worker is alive but has silently lost its queue consumer registration after a Redis broker restart - The existing health check (`celery inspect ping`) does not detect this state — the worker responds to ping but consumes no tasks - The new command performs a two-stage check: (1) `inspect.ping()` to verify the worker is alive, then (2) `inspect.active_queues()` to verify it has registered queue consumers - Update the official docker-compose health check to use the new command ## Root Cause This addresses a known upstream Celery bug (celery/celery#8030, celery/celery#9054, celery/celery#8990) where after a Redis broker restart, the worker reconnects at the transport level but fails to re-register its consumer on the queue. The worker process stays alive, `celery inspect ping` returns OK, but `inspect.active_queues()` returns `None`. Tasks pile up in the Redis queue and are never consumed. The partial fix in celery/celery#8796 did not fully resolve the issue, and it persists across Celery 5.2.x through 5.5.x with Redis broker. This PR takes a **defensive approach on Airflow's side** — instead of waiting for an upstream fix, we improve the health check to detect and recover from the catatonic state by triggering a container restart. ## Changes - **`celery_command.py`**: New `worker_health_check()` function with two-stage verification (ping + active_queues) - **`definition.py`**: Register `worker-health-check` CLI command with optional `-H` hostname argument - **`docker-compose.yaml`**: Update worker health check from `celery inspect ping` to `airflow celery worker-health-check` - **`test_celery_command.py`**: 7 new tests covering all scenarios (healthy, ping failure, catatonic states, auto-hostname) ## Test Plan - [x] Worker healthy (ping + queues OK) → exits 0 - [x] Ping returns None → exits non-zero - [x] Worker absent from ping result → exits non-zero - [x] Catatonic: ping OK but active_queues returns None → exits non-zero - [x] Catatonic: worker absent from active_queues → exits non-zero - [x] Catatonic: empty queue list → exits non-zero - [x] Auto-resolves hostname via `socket.gethostname()` when `-H` not provided - [x] All 7 new tests passing - [x] All pre-commit hooks pass Closes #63580 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.6 <[email protected]> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
