8silvergun opened a new issue, #62335:
URL: https://github.com/apache/airflow/issues/62335

   ### Apache Airflow Provider(s)
   
   fab
   
   ### Versions of Apache Airflow Providers
   
   ```
   apache-airflow-providers-fab==3.3.0
   apache-airflow-providers-common-sql==1.28.2
   apache-airflow-providers-mysql==6.3.4
   apache-airflow-providers-cncf-kubernetes==10.11.0
   apache-airflow-providers-celery==3.13.0
   apache-airflow-providers-standard==1.10.0
   ```
   
   ### Apache Airflow version
   
   3.1.6
   
   ### Operating System
   
   Debian 12 (bookworm) — official Airflow Docker image
   
   ### Deployment
   
   Official Apache Airflow Helm Chart
   
   ### Deployment details
   
   - **Kubernetes**: Amazon EKS
   - **Metadata DB**: Amazon Aurora MySQL (MySQL 8.0 compatible)
   - **MySQL `wait_timeout`**: 28800 seconds (8 hours)
   - **SQLAlchemy pool config**: `pool_recycle=60`, `pool_pre_ping=true`, 
`pool_size=3`, `max_overflow=2`
   - **api-server replicas**: 2 pods
   - **Airflow image**: Custom image based on `apache/airflow:3.1.6` with 
`providers-fab==3.3.0`
   
   ### What happened
   
   The `cleanup_session_middleware` introduced in PR #61480 (included in 
`providers-fab 3.3.0`) calls `Session.remove()` in a bare `finally` block 
without any error handling. When the underlying MySQL connection has been 
closed server-side (e.g., due to timeout, network interruption, or Aurora 
failover), `Session.remove()` internally attempts a `ROLLBACK` on the dead 
connection, which raises `MySQLdb.OperationalError: (2006, 'Server has gone 
away')`.
   
   This unhandled exception propagates up as a **500 Internal Server Error** to 
the client, even though the original request may have completed successfully.
   
   **Error log from api-server pod:**
   
   ```
   2026-02-20T05:50:24.526091553Z [error    ] Exception in ASGI application 
[airflow.providers.fab.auth_manager.fab_auth_manager] 
loc=fab_auth_manager.py:243
   Traceback (most recent call last):
     File ".../uvicorn/protocols/http/h11_impl.py", line 406, in run_asgi
       result = await app(self.scope, self.receive, self.send)
     ...
     File ".../airflow/providers/fab/auth_manager/fab_auth_manager.py", line 
243, in cleanup_session_middleware
       settings.Session.remove()
     File ".../sqlalchemy/orm/scoping.py", line 246, in remove
       self.registry().close()
     File ".../sqlalchemy/orm/session.py", line 2081, in close
       self._close_impl(invalidate=False)
     File ".../sqlalchemy/orm/session.py", line 2124, in _close_impl
       self.rollback()
     ...
     File ".../MySQLdb/connections.py", line 260, in query
       _mysql.connection.query(self, query)
   MySQLdb.OperationalError: (2006, 'Server has gone away')
   ```
   
   **Relevant source code** (`fab_auth_manager.py`, lines 235-243):
   
   ```python
   async def cleanup_session_middleware(request, call_next):
       try:
           response = await call_next(request)
           return response
       finally:
           from airflow import settings
   
           if settings.Session:
               settings.Session.remove()  # <-- unhandled exception here
   ```
   
   The `finally` block does not catch exceptions from `Session.remove()`. Since 
this is a cleanup operation, any failure here should be logged and suppressed — 
not propagated to the client.
   
   
   ### What you think should happen instead
   
   `Session.remove()` in the `finally` block should be wrapped with 
`suppress(Exception)` to gracefully handle database connection errors during 
cleanup. The cleanup middleware's purpose is to prevent stale sessions — if 
cleanup itself fails because the connection is already dead, that's not an 
error that should affect the HTTP response.
   
   **Suggested fix:**
   ```python
   async def cleanup_session_middleware(request, call_next):
       try:
           response = await call_next(request)
           return response
       finally:
           from airflow import settings
           if settings.Session:
               with suppress(Exception):
                   settings.Session.remove()
   ```
   
   This is consistent with the `suppress(Exception)` pattern already used in 
`deserialize_user` (PR [#62153](https://github.com/apache/airflow/pull/62153), 
merged 2026-02-19) for identical session cleanup error handling. The `from 
contextlib import suppress` import already exists in the file.
   
   ### How to reproduce
   
   1. Deploy Airflow 3.1.6 with `providers-fab==3.3.0` using MySQL (Aurora 
MySQL) as metadata DB
   2. Configure SQLAlchemy with `pool_pre_ping=true` and `pool_recycle=60`
   3. Have api-server running with multiple replicas
   4. Wait for a MySQL connection in the SQLAlchemy pool to become stale 
(connection closed server-side due to timeout, network issue, or Aurora 
maintenance)
   5. Send a request to the api-server (e.g., login via `/auth/fab/v1/login`) 
that triggers `cleanup_session_middleware`
   6. The stale connection causes `Session.remove()` → `ROLLBACK` → 
`MySQLdb.OperationalError: (2006, 'Server has gone away')` → 500 error
   
   **Note**: This is timing-dependent and occurs intermittently. In our 
production environment, it appeared on 1 of 2 api-server pods. The issue is 
more likely to manifest with MySQL than PostgreSQL, since MySQL's `Server has 
gone away` error has no automatic retry at the driver level.
   
   ### Anything else
   
   **Context — this is a follow-up to PR #61480:**
   
   PR [#61480](https://github.com/apache/airflow/pull/61480) correctly 
addressed the root cause of `PendingRollbackError` (issue 
[#59349](https://github.com/apache/airflow/issues/59349)) by adding 
`cleanup_session_middleware` to ensure `Session.remove()` runs after every 
request. However, the `finally` block assumes `Session.remove()` always 
succeeds. When the DB connection is already dead, the cleanup itself fails and 
turns a successful request into a 500 error.
   
   **Impact:**
   - Intermittent 500 errors on api-server login/UI pages
   - Self-recovers on retry (next request gets a fresh connection from the pool)
   - In our case: 2 occurrences over several days, both on the same pod
   
   **Related issues and PRs:**
    [#59349](https://github.com/apache/airflow/issues/59349) — Original 
`PendingRollbackError` issue that motivated PR #61480
    [#61480](https://github.com/apache/airflow/pull/61480) — PR that introduced 
`cleanup_session_middleware`
    [#62153](https://github.com/apache/airflow/pull/62153) — PR that 
established the `suppress(Exception)` pattern for session cleanup in 
`deserialize_user` (same class of problem, different code path)
    [#57470](https://github.com/apache/airflow/issues/57470), 
[#57859](https://github.com/apache/airflow/issues/57859) — Earlier reports of 
the same session lifecycle problem
   
   **Environment evidence:**
   - `pool_pre_ping=true` is enabled, which means SQLAlchemy validates 
connections before use — but `Session.remove()` bypasses this check since it 
operates on an already-bound session
   - MySQL `wait_timeout=28800` (8h) and `pool_recycle=60` should prevent most 
stale connections, but edge cases (Aurora failover, network blips) can still 
cause disconnections
   
   <details>
   <summary>Full error traceback</summary>
   
   ```
   2026-02-20T05:50:24.526091553Z [error    ] Exception in ASGI application 
[airflow.providers.fab.auth_manager.fab_auth_manager] 
loc=fab_auth_manager.py:243
   Traceback (most recent call last):
     File 
"/home/airflow/.local/lib/python3.12/site-packages/uvicorn/protocols/http/h11_impl.py",
 line 406, in run_asgi
       result = await app(self.scope, self.receive, self.send)
     File 
"/home/airflow/.local/lib/python3.12/site-packages/uvicorn/middleware/proxy_headers.py",
 line 60, in __call__
       return await self.app(scope, receive, send)
     File 
"/home/airflow/.local/lib/python3.12/site-packages/starlette/middleware/base.py",
 line 101, in __call__
       response = await self.dispatch_func(request, call_next)
     File 
"/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/fab/auth_manager/fab_auth_manager.py",
 line 243, in cleanup_session_middleware
       settings.Session.remove()
     File 
"/home/airflow/.local/lib/python3.12/site-packages/sqlalchemy/orm/scoping.py", 
line 246, in remove
       self.registry().close()
     File 
"/home/airflow/.local/lib/python3.12/site-packages/sqlalchemy/orm/session.py", 
line 2081, in close
       self._close_impl(invalidate=False)
     File 
"/home/airflow/.local/lib/python3.12/site-packages/sqlalchemy/orm/session.py", 
line 2124, in _close_impl
       self.rollback()
     File 
"/home/airflow/.local/lib/python3.12/site-packages/sqlalchemy/orm/session.py", 
line 1982, in rollback
       self._transaction.rollback(_to_root=True)
     File 
"/home/airflow/.local/lib/python3.12/site-packages/sqlalchemy/orm/session.py", 
line 1040, in rollback
       self._connection_rollback(self._connections[transaction])
     File 
"/home/airflow/.local/lib/python3.12/site-packages/sqlalchemy/orm/session.py", 
line 1092, in _connection_rollback
       connection.rollback()
     File 
"/home/airflow/.local/lib/python3.12/site-packages/sqlalchemy/engine/base.py", 
line 1065, in rollback
       self._transaction.rollback()
     File 
"/home/airflow/.local/lib/python3.12/site-packages/sqlalchemy/engine/base.py", 
line 1768, in rollback
       self.connection._rollback_impl()
     File 
"/home/airflow/.local/lib/python3.12/site-packages/sqlalchemy/engine/base.py", 
line 902, in _rollback_impl
       self._handle_dbapi_exception(e, None, None, None, None)
     File 
"/home/airflow/.local/lib/python3.12/site-packages/sqlalchemy/engine/base.py", 
line 2240, in _handle_dbapi_exception
       raise sqlalchemy_exception.with_traceback(exc_info[2]) from e
     File 
"/home/airflow/.local/lib/python3.12/site-packages/sqlalchemy/engine/base.py", 
line 899, in _rollback_impl
       self.connection.dbapi_connection.rollback()
     File 
"/home/airflow/.local/lib/python3.12/site-packages/MySQLdb/connections.py", 
line 272, in rollback
       self.query("ROLLBACK")
     File 
"/home/airflow/.local/lib/python3.12/site-packages/MySQLdb/connections.py", 
line 260, in query
       _mysql.connection.query(self, query)
   MySQLdb.OperationalError: (2006, 'Server has gone away')
   ```
   
   </details>
   
   ### Are you willing to submit PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to