jrod9k1 opened a new issue, #60144:
URL: https://github.com/apache/airflow/issues/60144
### Apache Airflow version
3.1.5
### If "Other Airflow 3 version" selected, which one?
_No response_
### What happened?
When configuring a celery worker to use the "gevent" pool type, task logs in
the frontend UI for task runs will fail to return instead erroring with
```
Log message source details sources=["Could not read served logs:
HTTPConnectionPool(host='{workername}', port=8793): Read timed out. (read
timeout=5)"]
```
Seems to have been caused by moving the worker's log functionality to
FastAPI in 3.1.x. Gevent's monkey patching breaks async behavior there.
Links/details below.
### What you think should happen instead?
Logs should successfully return when visiting a task run in the UI.
### How to reproduce
1. Setup an Airflow site with a single celery worker
2. Configure the worker to use the gevent pool type
```
[celery]
# ... snip ...
pool = gevent
```
3. Run a task on the worker
4. Attempt to view the log for the task run in the UI
### Operating System
SUSE Linux Enterprise Server 15 SP5
### Versions of Apache Airflow Providers
apache-airflow-providers-apache-kafka 1.11.0 pypi_0 pypi
apache-airflow-providers-atlassian-jira 3.3.0 pypi_0
pypi
apache-airflow-providers-celery 3.14.0 pypi_0 pypi
apache-airflow-providers-cncf-kubernetes 10.11.0 pypi_0
pypi
apache-airflow-providers-common-compat 1.10.0 pypi_0
pypi
apache-airflow-providers-common-io 1.7.0 pypi_0 pypi
apache-airflow-providers-common-sql 1.30.0 pypi_0 pypi
apache-airflow-providers-docker 4.5.0 pypi_0 pypi
apache-airflow-providers-elasticsearch 6.4.0 pypi_0
pypi
apache-airflow-providers-fab 3.0.3 pypi_0 pypi
apache-airflow-providers-http 5.6.0 pypi_0 pypi
apache-airflow-providers-influxdb 2.10.0 pypi_0 pypi
apache-airflow-providers-microsoft-mssql 4.4.0 pypi_0
pypi
apache-airflow-providers-microsoft-winrm 3.13.0 pypi_0
pypi
apache-airflow-providers-mysql 6.4.0 pypi_0 pypi
apache-airflow-providers-openlineage 2.9.0 pypi_0 pypi
apache-airflow-providers-opensearch 1.8.0 pypi_0 pypi
apache-airflow-providers-opsgenie 5.10.0 pypi_0 pypi
apache-airflow-providers-postgres 6.5.0 pypi_0 pypi
apache-airflow-providers-smtp 2.4.0 pypi_0 pypi
apache-airflow-providers-ssh 4.2.0 pypi_0 pypi
apache-airflow-providers-standard 1.10.0 pypi_0 pypi
### Deployment
Other
### Deployment details
We deploy to VMs using a custom built conda-pack'd virtual environment.
Deployment is performed using Ansible. Installed a secure environment without
internet access.
### Anything else?
- I'd be willing to submit a PR if this is within my capabilities, though
this feels like a larger architectural decision beyond the reach of a casual PR?
- This bug can also be reproduced by setting the env var
_AIRFLOW_PATCH_GEVENT=1 as it also triggers monkey patching logic
I chased it for a bit and this bug seems to be an issue in Airflow 3.1+
since the worker log functionality was moved over to FastAPI
https://github.com/apache/airflow/pull/52581
When you flip the celery worker to the gevent pool, it triggers monkey
patching that patches over core modules[
https://github.com/apache/airflow/blob/3.1.3/providers/celery/src/airflow/providers/celery/cli/celery_command.py#L262](https://github.com/apache/airflow/blob/3.1.3/providers/celery/src/airflow/providers/celery/cli/celery_command.py#L262)
This seems to cause issues with async logic in FastAPI that is a known
incompatibility https://github.com/fastapi/fastapi/discussions/6395
I confirmed with wireshark and strace in our Airflow site that the
api-server made the call for the log, it was received, TCP ack'd, and the
worker stats the correct log file on disk before pinging an eventfd and going
out to lunch. I am guessing this is where some FastAPI async logic under the
hood blows up, ie the bug
```
[pid 47886] lstat("../logs/airflow/base", {st_mode=S_IFDIR|0740, st_size=62,
...}) = 0
[pid 47886]
stat(".../logs/airflow/base/dag_id=tutorial_dag/run_id=manual__2025-12-11T16:46:31+00:00/task_id=extract/attempt=1.log",
{st_mode=S_IFREG|0644, st_size=656, ...}) = 0
[pid 47886] mmap(NULL, 16384, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4371b3a000
[pid 47886] gettid() = 47886
[pid 47886] epoll_pwait(11, [{events=EPOLLIN, data={u32=18, u64=18}}], 1024,
0, NULL, 8) = 1
[pid 47886] read(18, "\1\0\0\0\0\0\0\0", 1024) = 8
[pid 47886] epoll_pwait(11, [], 1024, 0, NULL, 8) = 0
[pid 47886] epoll_pwait(11, [], 1024, 0, NULL, 8) = 0
```
Largely irrelevant for Airflow but the core issue seems to be due to monkey
patching of both the "queue" and "thread" modules. Doing a quick and dirty
patch of the celery module in a running Airflow instance like this causes the
bug to disappear
`lib/python3.12/site-packages/celery/__init__.py`
```python
gevent.monkey.patch_all()
```
to
```python
gevent.monkey.patch_all(queue=False, thread=False)
```
### Are you willing to submit PR?
- [x] Yes I am willing to submit a PR!
### Code of Conduct
- [x] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]