antonio-mello-ai commented on PR #63583: URL: https://github.com/apache/airflow/pull/63583#issuecomment-4063080825
@eladkal Fair point — submitting upstream fixes instead. I've identified the root cause and submitted two companion PRs: 1. **Kombu** [celery/kombu#2492](https://github.com/celery/kombu/pull/2492) — `_on_disconnect` in the Redis transport removes the `on_poll_start` callback from the event loop on disconnect. During reconnection, a stale channel's `_on_disconnect` can fire *after* the new channel has re-registered, removing the new callback. The worker stays alive but never calls `_register_BRPOP` — tasks pile up and are never consumed. Fix: stop removing `on_poll_start` (it's idempotent). 2. **Celery** [celery/celery#10204](https://github.com/celery/celery/pull/10204) — `synloop` (gevent/eventlet path) lacks the `hub.reset()` cleanup that `asynloop` already has. Stale state persists across reconnection, preventing consumer re-registration. Fix: add the same `hub.reset()` pattern. These two fixes together address the "catatonic worker" problem at the root. If they're accepted upstream, the health check in this PR becomes unnecessary. Happy to close this PR and track progress on the upstream fixes instead, or keep it open as a defense-in-depth measure until the upstream fixes land in a release — your call. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
