antonio-mello-ai opened a new issue, #63580:
URL: https://github.com/apache/airflow/issues/63580

   ### Apache Airflow version
   
   3.1.7
   
   ### If "Other Airflow 2 version" selected, which one?
   
   _No response_
   
   ### What happened?
   
   After a Redis broker restart (e.g., VM reboot), the Celery worker reconnects 
at the transport level but silently loses its consumer registration on the task 
queue. The worker process stays alive, `celery inspect ping` returns OK, but 
`inspect.active_queues()` returns `None` — the worker is in a **catatonic 
state** where it accepts no new tasks.
   
   This is a known upstream Celery bug (celery/celery#8030, celery/celery#9054, 
celery/celery#8990) that has persisted across Celery 5.2.x through 5.5.x with 
Redis broker. The partial fix in celery/celery#8796 did not fully resolve it.
   
   **The problem on Airflow's side**: The current worker liveness/health check 
mechanism does not detect this state. Since `celery inspect ping` responds 
normally, Docker/Kubernetes health probes pass, and the worker is never 
restarted. Tasks accumulate in the Redis queue with state `queued`, hit the 
scheduler's requeue limit, and are marked `failed`.
   
   In our case, 302 tasks piled up in the `default` queue over ~14 hours before 
we noticed. 7 DAGs failed, all with `Task requeue attempts exceeded max; 
marking failed`.
   
   ### What you think should happen instead?
   
   The Airflow Celery worker health check should verify that the worker has 
active queue consumers, not just that it responds to ping. Specifically:
   
   ```python
   # Current check (insufficient):
   result = app.control.inspect().ping()
   # Returns OK even in catatonic state
   
   # Proposed additional check:
   queues = app.control.inspect().active_queues()
   if queues is None or worker_name not in queues:
       # Worker is alive but not consuming — health check should FAIL
       return False
   ```
   
   If the worker is alive but has no registered queues, the health check should 
fail, triggering a container restart via the orchestrator (Docker, Kubernetes, 
systemd).
   
   This would go in the `airflow-providers-celery` package, likely in the 
worker CLI health check logic.
   
   ### How to reproduce
   
   1. Start Airflow with CeleryExecutor and Redis broker
   2. Confirm worker is consuming tasks normally
   3. Restart Redis (e.g., `docker restart redis` or reboot the Redis host)
   4. Observe: worker process stays alive, `celery inspect ping` returns OK
   5. `celery inspect active_queues` returns `None` or empty for the worker
   6. Schedule a DAG — task goes to `queued` state and is never picked up
   7. After scheduler requeue attempts, task is marked `failed`
   
   ### Operating System
   
   Debian 12 (Proxmox VM)
   
   ### Versions of Apache Airflow Providers
   
   apache-airflow-providers-celery (installed with Airflow 3.1.7)
   
   ### Deployment
   
   Docker Compose
   
   ### Deployment details
   
   - Airflow 3.1.7 (Docker image `apache/airflow:3.1.7-python3.11`)
   - Celery worker with 16 fork pool workers
   - Redis 7.2.10 as broker
   - PostgreSQL as result backend
   - `worker_prefetch_multiplier = 1`
   
   ### Anything else?
   
   **Related upstream Celery issues:**
   - celery/celery#8030
   - celery/celery#8091
   - celery/celery#8990 (confirms not fixed in 5.4.0)
   - celery/celery#9054
   - celery/celery#9191
   
   **Previous Airflow issues (closed as upstream):**
   - #26542
   - #32484
   - #24498
   - #27032
   
   Those issues were rightfully closed as upstream Celery bugs. This issue 
proposes a **defensive fix on Airflow's side** — improving the health check to 
detect and recover from the catatonic state, regardless of when Celery fixes 
the root cause.
   
   **Workarounds:**
   - `--without-heartbeat --without-gossip --without-mingle` avoids the code 
path but loses cluster features
   - `broker_connection_retry = False` makes the worker crash on broker loss 
(requires restart policy)
   - Custom health check script checking `active_queues()` instead of `ping`
   
   I'm willing to work on a PR for this.
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to