antonio-mello-ai opened a new issue, #63580:
URL: https://github.com/apache/airflow/issues/63580
### Apache Airflow version
3.1.7
### If "Other Airflow 2 version" selected, which one?
_No response_
### What happened?
After a Redis broker restart (e.g., VM reboot), the Celery worker reconnects
at the transport level but silently loses its consumer registration on the task
queue. The worker process stays alive, `celery inspect ping` returns OK, but
`inspect.active_queues()` returns `None` — the worker is in a **catatonic
state** where it accepts no new tasks.
This is a known upstream Celery bug (celery/celery#8030, celery/celery#9054,
celery/celery#8990) that has persisted across Celery 5.2.x through 5.5.x with
Redis broker. The partial fix in celery/celery#8796 did not fully resolve it.
**The problem on Airflow's side**: The current worker liveness/health check
mechanism does not detect this state. Since `celery inspect ping` responds
normally, Docker/Kubernetes health probes pass, and the worker is never
restarted. Tasks accumulate in the Redis queue with state `queued`, hit the
scheduler's requeue limit, and are marked `failed`.
In our case, 302 tasks piled up in the `default` queue over ~14 hours before
we noticed. 7 DAGs failed, all with `Task requeue attempts exceeded max;
marking failed`.
### What you think should happen instead?
The Airflow Celery worker health check should verify that the worker has
active queue consumers, not just that it responds to ping. Specifically:
```python
# Current check (insufficient):
result = app.control.inspect().ping()
# Returns OK even in catatonic state
# Proposed additional check:
queues = app.control.inspect().active_queues()
if queues is None or worker_name not in queues:
# Worker is alive but not consuming — health check should FAIL
return False
```
If the worker is alive but has no registered queues, the health check should
fail, triggering a container restart via the orchestrator (Docker, Kubernetes,
systemd).
This would go in the `airflow-providers-celery` package, likely in the
worker CLI health check logic.
### How to reproduce
1. Start Airflow with CeleryExecutor and Redis broker
2. Confirm worker is consuming tasks normally
3. Restart Redis (e.g., `docker restart redis` or reboot the Redis host)
4. Observe: worker process stays alive, `celery inspect ping` returns OK
5. `celery inspect active_queues` returns `None` or empty for the worker
6. Schedule a DAG — task goes to `queued` state and is never picked up
7. After scheduler requeue attempts, task is marked `failed`
### Operating System
Debian 12 (Proxmox VM)
### Versions of Apache Airflow Providers
apache-airflow-providers-celery (installed with Airflow 3.1.7)
### Deployment
Docker Compose
### Deployment details
- Airflow 3.1.7 (Docker image `apache/airflow:3.1.7-python3.11`)
- Celery worker with 16 fork pool workers
- Redis 7.2.10 as broker
- PostgreSQL as result backend
- `worker_prefetch_multiplier = 1`
### Anything else?
**Related upstream Celery issues:**
- celery/celery#8030
- celery/celery#8091
- celery/celery#8990 (confirms not fixed in 5.4.0)
- celery/celery#9054
- celery/celery#9191
**Previous Airflow issues (closed as upstream):**
- #26542
- #32484
- #24498
- #27032
Those issues were rightfully closed as upstream Celery bugs. This issue
proposes a **defensive fix on Airflow's side** — improving the health check to
detect and recover from the catatonic state, regardless of when Celery fixes
the root cause.
**Workarounds:**
- `--without-heartbeat --without-gossip --without-mingle` avoids the code
path but loses cluster features
- `broker_connection_retry = False` makes the worker crash on broker loss
(requires restart policy)
- Custom health check script checking `active_queues()` instead of `ping`
I'm willing to work on a PR for this.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]