jrod9k1 opened a new issue, #60144:
URL: https://github.com/apache/airflow/issues/60144

   ### Apache Airflow version
   
   3.1.5
   
   ### If "Other Airflow 3 version" selected, which one?
   
   _No response_
   
   ### What happened?
   
   When configuring a celery worker to use the "gevent" pool type, task logs in 
the frontend UI for task runs will fail to return instead erroring with
   
   ```
   Log message source details sources=["Could not read served logs: 
HTTPConnectionPool(host='{workername}', port=8793): Read timed out. (read 
timeout=5)"]
   ```
   
   Seems to have been caused by moving the worker's log functionality to 
FastAPI in 3.1.x. Gevent's monkey patching breaks async behavior there. 
Links/details below.
   
   ### What you think should happen instead?
   
   Logs should successfully return when visiting a task run in the UI.
   
   ### How to reproduce
   
   1. Setup an Airflow site with a single celery worker
   2. Configure the worker to use the gevent pool type
   
   ```
   [celery]
   # ... snip ...
   pool = gevent
   ```
   
   3. Run a task on the worker
   4. Attempt to view the log for the task run in the UI
   
   ### Operating System
   
   SUSE Linux Enterprise Server 15 SP5
   
   ### Versions of Apache Airflow Providers
   
   apache-airflow-providers-apache-kafka 1.11.0                   pypi_0    pypi
   apache-airflow-providers-atlassian-jira 3.3.0                    pypi_0    
pypi
   apache-airflow-providers-celery 3.14.0                   pypi_0    pypi
   apache-airflow-providers-cncf-kubernetes 10.11.0                  pypi_0    
pypi
   apache-airflow-providers-common-compat 1.10.0                   pypi_0    
pypi
   apache-airflow-providers-common-io 1.7.0                    pypi_0    pypi
   apache-airflow-providers-common-sql 1.30.0                   pypi_0    pypi
   apache-airflow-providers-docker 4.5.0                    pypi_0    pypi
   apache-airflow-providers-elasticsearch 6.4.0                    pypi_0    
pypi
   apache-airflow-providers-fab 3.0.3                    pypi_0    pypi
   apache-airflow-providers-http 5.6.0                    pypi_0    pypi
   apache-airflow-providers-influxdb 2.10.0                   pypi_0    pypi
   apache-airflow-providers-microsoft-mssql 4.4.0                    pypi_0    
pypi
   apache-airflow-providers-microsoft-winrm 3.13.0                   pypi_0    
pypi
   apache-airflow-providers-mysql 6.4.0                    pypi_0    pypi
   apache-airflow-providers-openlineage 2.9.0                    pypi_0    pypi
   apache-airflow-providers-opensearch 1.8.0                    pypi_0    pypi
   apache-airflow-providers-opsgenie 5.10.0                   pypi_0    pypi
   apache-airflow-providers-postgres 6.5.0                    pypi_0    pypi
   apache-airflow-providers-smtp 2.4.0                    pypi_0    pypi
   apache-airflow-providers-ssh 4.2.0                    pypi_0    pypi
   apache-airflow-providers-standard 1.10.0                   pypi_0    pypi
   
   ### Deployment
   
   Other
   
   ### Deployment details
   
   We deploy to VMs using a custom built conda-pack'd virtual environment. 
Deployment is performed using Ansible. Installed a secure environment without 
internet access.
   
   ### Anything else?
   
   - I'd be willing to submit a PR if this is within my capabilities, though 
this feels like a larger architectural decision beyond the reach of a casual PR?
   - This bug can also be reproduced by setting the env var 
_AIRFLOW_PATCH_GEVENT=1 as it also triggers monkey patching logic
   
   I chased it for a bit and this bug seems to be an issue in Airflow 3.1+ 
since the worker log functionality was moved over to FastAPI 
https://github.com/apache/airflow/pull/52581
   
   When you flip the celery worker to the gevent pool, it triggers monkey 
patching that patches over core modules[ 
https://github.com/apache/airflow/blob/3.1.3/providers/celery/src/airflow/providers/celery/cli/celery_command.py#L262](https://github.com/apache/airflow/blob/3.1.3/providers/celery/src/airflow/providers/celery/cli/celery_command.py#L262)
   
   This seems to cause issues with async logic in FastAPI that is a known 
incompatibility https://github.com/fastapi/fastapi/discussions/6395
   
   I confirmed with wireshark and strace in our Airflow site that the 
api-server made the call for the log, it was received, TCP ack'd, and the 
worker stats the correct log file on disk before pinging an eventfd and going 
out to lunch. I am guessing this is where some FastAPI async logic under the 
hood blows up, ie the bug
   
   ```
   [pid 47886] lstat("../logs/airflow/base", {st_mode=S_IFDIR|0740, st_size=62, 
...}) = 0
   [pid 47886] 
stat(".../logs/airflow/base/dag_id=tutorial_dag/run_id=manual__2025-12-11T16:46:31+00:00/task_id=extract/attempt=1.log",
 {st_mode=S_IFREG|0644, st_size=656, ...}) = 0
   [pid 47886] mmap(NULL, 16384, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4371b3a000
   [pid 47886] gettid()                    = 47886
   [pid 47886] epoll_pwait(11, [{events=EPOLLIN, data={u32=18, u64=18}}], 1024, 
0, NULL, 8) = 1
   [pid 47886] read(18, "\1\0\0\0\0\0\0\0", 1024) = 8
   [pid 47886] epoll_pwait(11, [], 1024, 0, NULL, 8) = 0
   [pid 47886] epoll_pwait(11, [], 1024, 0, NULL, 8) = 0
   ```
   
   Largely irrelevant for Airflow but the core issue seems to be due to monkey 
patching of both the "queue" and "thread" modules. Doing a quick and dirty 
patch of the celery module in a running Airflow instance like this causes the 
bug to disappear
   
   `lib/python3.12/site-packages/celery/__init__.py`
   
   ```python
   gevent.monkey.patch_all()
   ```
   
   to
   
   ```python
   gevent.monkey.patch_all(queue=False, thread=False)
   ```
   
   ### Are you willing to submit PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to