eladkal commented on PR #63583:
URL: https://github.com/apache/airflow/pull/63583#issuecomment-4137822323

   > @eladkal Fair point — submitting upstream fixes instead.
   > 
   > I've identified the root cause and submitted two companion PRs:
   > 
   > 1. **Kombu** 
[celery/kombu#2492](https://github.com/celery/kombu/pull/2492) — 
`_on_disconnect` in the Redis transport removes the `on_poll_start` callback 
from the event loop on disconnect. During reconnection, a stale channel's 
`_on_disconnect` can fire _after_ the new channel has re-registered, removing 
the new callback. The worker stays alive but never calls `_register_BRPOP` — 
tasks pile up and are never consumed. Fix: stop removing `on_poll_start` (it's 
idempotent).
   > 2. **Celery** 
[celery/celery#10204](https://github.com/celery/celery/pull/10204) — `synloop` 
(gevent/eventlet path) lacks the `hub.reset()` cleanup that `asynloop` already 
has. Stale state persists across reconnection, preventing consumer 
re-registration. Fix: add the same `hub.reset()` pattern.
   > 
   > These two fixes together address the "catatonic worker" problem at the 
root. If they're accepted upstream, the health check in this PR becomes 
unnecessary. Happy to close this PR and track progress on the upstream fixes 
instead, or keep it open as a defense-in-depth measure until the upstream fixes 
land in a release — your call.
   
   
   Looks like both PRs were merged and targeted Celery 5.7.0
   So I assume this PR just need to update the minimum celery version once 
upstream release it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to