Vamsi-klu commented on PR #63583:
URL: https://github.com/apache/airflow/pull/63583#issuecomment-4061375105

   Hey @antonio-mello-ai, the catatonic worker problem after broker restarts is 
a real pain — we've seen similar issues in production, so this is a welcome 
addition. The two-stage check (ping then active_queues) gives good diagnostic 
clarity. A few things I noticed:
   
   **`os.EX_UNAVAILABLE` is Unix-only.** It doesn't exist on Windows, and every 
other `SystemExit` in this same file uses a descriptive string message (which 
gives exit code 1). Switching to something like `raise SystemExit("Worker 
health check failed: no active queues")` would be more consistent and portable. 
That also removes the need for the `import os` inside the function.
   
   **No timeout on the inspect calls.** If the broker is unreachable or slow, 
`inspect.ping()` and `inspect.active_queues()` can hang indefinitely. Docker's 
container-level timeout provides some protection, but for a health check 
specifically, adding `timeout=5.0` to the `inspect()` constructor would be good 
defense in depth. Something like:
   
   ```python
   inspector = celery_app.control.inspect(destination=[hostname], timeout=5.0)
   ```
   
   **No exception handling around the inspect calls.** If the broker is fully 
down (not just restarted), these calls could raise `ConnectionError` or 
similar. Right now that would produce an unhandled traceback instead of a clean 
health check failure. A try/except that logs a clear message and exits non-zero 
would make the output much cleaner for orchestrators parsing health check logs.
   
   **Missing test for broker connection errors.** The 7 tests cover the happy 
path and the various catatonic states well, but there's no test for what 
happens when `inspect.ping()` itself raises an exception (e.g., broker 
unreachable). That's probably the most common real-world failure scenario for a 
health check.
   
   **Minor:** `import socket` is a stdlib module with no initialization side 
effects, so the "avoid triggering Providers Manager" rationale for local 
imports doesn't really apply to it. Could be at module level.
   
   Overall the approach is solid — detecting catatonic workers through 
`active_queues()` is the right way to catch this specific upstream Celery bug. 
Just needs a bit of hardening around edge cases.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to