Vamsi-klu commented on PR #63583:
URL: https://github.com/apache/airflow/pull/63583#issuecomment-4061375105
Hey @antonio-mello-ai, the catatonic worker problem after broker restarts is
a real pain — we've seen similar issues in production, so this is a welcome
addition. The two-stage check (ping then active_queues) gives good diagnostic
clarity. A few things I noticed:
**`os.EX_UNAVAILABLE` is Unix-only.** It doesn't exist on Windows, and every
other `SystemExit` in this same file uses a descriptive string message (which
gives exit code 1). Switching to something like `raise SystemExit("Worker
health check failed: no active queues")` would be more consistent and portable.
That also removes the need for the `import os` inside the function.
**No timeout on the inspect calls.** If the broker is unreachable or slow,
`inspect.ping()` and `inspect.active_queues()` can hang indefinitely. Docker's
container-level timeout provides some protection, but for a health check
specifically, adding `timeout=5.0` to the `inspect()` constructor would be good
defense in depth. Something like:
```python
inspector = celery_app.control.inspect(destination=[hostname], timeout=5.0)
```
**No exception handling around the inspect calls.** If the broker is fully
down (not just restarted), these calls could raise `ConnectionError` or
similar. Right now that would produce an unhandled traceback instead of a clean
health check failure. A try/except that logs a clear message and exits non-zero
would make the output much cleaner for orchestrators parsing health check logs.
**Missing test for broker connection errors.** The 7 tests cover the happy
path and the various catatonic states well, but there's no test for what
happens when `inspect.ping()` itself raises an exception (e.g., broker
unreachable). That's probably the most common real-world failure scenario for a
health check.
**Minor:** `import socket` is a stdlib module with no initialization side
effects, so the "avoid triggering Providers Manager" rationale for local
imports doesn't really apply to it. Could be at module level.
Overall the approach is solid — detecting catatonic workers through
`active_queues()` is the right way to catch this specific upstream Celery bug.
Just needs a bit of hardening around edge cases.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]