8silvergun opened a new issue, #62335:
URL: https://github.com/apache/airflow/issues/62335
### Apache Airflow Provider(s)
fab
### Versions of Apache Airflow Providers
```
apache-airflow-providers-fab==3.3.0
apache-airflow-providers-common-sql==1.28.2
apache-airflow-providers-mysql==6.3.4
apache-airflow-providers-cncf-kubernetes==10.11.0
apache-airflow-providers-celery==3.13.0
apache-airflow-providers-standard==1.10.0
```
### Apache Airflow version
3.1.6
### Operating System
Debian 12 (bookworm) — official Airflow Docker image
### Deployment
Official Apache Airflow Helm Chart
### Deployment details
- **Kubernetes**: Amazon EKS
- **Metadata DB**: Amazon Aurora MySQL (MySQL 8.0 compatible)
- **MySQL `wait_timeout`**: 28800 seconds (8 hours)
- **SQLAlchemy pool config**: `pool_recycle=60`, `pool_pre_ping=true`,
`pool_size=3`, `max_overflow=2`
- **api-server replicas**: 2 pods
- **Airflow image**: Custom image based on `apache/airflow:3.1.6` with
`providers-fab==3.3.0`
### What happened
The `cleanup_session_middleware` introduced in PR #61480 (included in
`providers-fab 3.3.0`) calls `Session.remove()` in a bare `finally` block
without any error handling. When the underlying MySQL connection has been
closed server-side (e.g., due to timeout, network interruption, or Aurora
failover), `Session.remove()` internally attempts a `ROLLBACK` on the dead
connection, which raises `MySQLdb.OperationalError: (2006, 'Server has gone
away')`.
This unhandled exception propagates up as a **500 Internal Server Error** to
the client, even though the original request may have completed successfully.
**Error log from api-server pod:**
```
2026-02-20T05:50:24.526091553Z [error ] Exception in ASGI application
[airflow.providers.fab.auth_manager.fab_auth_manager]
loc=fab_auth_manager.py:243
Traceback (most recent call last):
File ".../uvicorn/protocols/http/h11_impl.py", line 406, in run_asgi
result = await app(self.scope, self.receive, self.send)
...
File ".../airflow/providers/fab/auth_manager/fab_auth_manager.py", line
243, in cleanup_session_middleware
settings.Session.remove()
File ".../sqlalchemy/orm/scoping.py", line 246, in remove
self.registry().close()
File ".../sqlalchemy/orm/session.py", line 2081, in close
self._close_impl(invalidate=False)
File ".../sqlalchemy/orm/session.py", line 2124, in _close_impl
self.rollback()
...
File ".../MySQLdb/connections.py", line 260, in query
_mysql.connection.query(self, query)
MySQLdb.OperationalError: (2006, 'Server has gone away')
```
**Relevant source code** (`fab_auth_manager.py`, lines 235-243):
```python
async def cleanup_session_middleware(request, call_next):
try:
response = await call_next(request)
return response
finally:
from airflow import settings
if settings.Session:
settings.Session.remove() # <-- unhandled exception here
```
The `finally` block does not catch exceptions from `Session.remove()`. Since
this is a cleanup operation, any failure here should be logged and suppressed —
not propagated to the client.
### What you think should happen instead
`Session.remove()` in the `finally` block should be wrapped with
`suppress(Exception)` to gracefully handle database connection errors during
cleanup. The cleanup middleware's purpose is to prevent stale sessions — if
cleanup itself fails because the connection is already dead, that's not an
error that should affect the HTTP response.
**Suggested fix:**
```python
async def cleanup_session_middleware(request, call_next):
try:
response = await call_next(request)
return response
finally:
from airflow import settings
if settings.Session:
with suppress(Exception):
settings.Session.remove()
```
This is consistent with the `suppress(Exception)` pattern already used in
`deserialize_user` (PR [#62153](https://github.com/apache/airflow/pull/62153),
merged 2026-02-19) for identical session cleanup error handling. The `from
contextlib import suppress` import already exists in the file.
### How to reproduce
1. Deploy Airflow 3.1.6 with `providers-fab==3.3.0` using MySQL (Aurora
MySQL) as metadata DB
2. Configure SQLAlchemy with `pool_pre_ping=true` and `pool_recycle=60`
3. Have api-server running with multiple replicas
4. Wait for a MySQL connection in the SQLAlchemy pool to become stale
(connection closed server-side due to timeout, network issue, or Aurora
maintenance)
5. Send a request to the api-server (e.g., login via `/auth/fab/v1/login`)
that triggers `cleanup_session_middleware`
6. The stale connection causes `Session.remove()` → `ROLLBACK` →
`MySQLdb.OperationalError: (2006, 'Server has gone away')` → 500 error
**Note**: This is timing-dependent and occurs intermittently. In our
production environment, it appeared on 1 of 2 api-server pods. The issue is
more likely to manifest with MySQL than PostgreSQL, since MySQL's `Server has
gone away` error has no automatic retry at the driver level.
### Anything else
**Context — this is a follow-up to PR #61480:**
PR [#61480](https://github.com/apache/airflow/pull/61480) correctly
addressed the root cause of `PendingRollbackError` (issue
[#59349](https://github.com/apache/airflow/issues/59349)) by adding
`cleanup_session_middleware` to ensure `Session.remove()` runs after every
request. However, the `finally` block assumes `Session.remove()` always
succeeds. When the DB connection is already dead, the cleanup itself fails and
turns a successful request into a 500 error.
**Impact:**
- Intermittent 500 errors on api-server login/UI pages
- Self-recovers on retry (next request gets a fresh connection from the pool)
- In our case: 2 occurrences over several days, both on the same pod
**Related issues and PRs:**
[#59349](https://github.com/apache/airflow/issues/59349) — Original
`PendingRollbackError` issue that motivated PR #61480
[#61480](https://github.com/apache/airflow/pull/61480) — PR that introduced
`cleanup_session_middleware`
[#62153](https://github.com/apache/airflow/pull/62153) — PR that
established the `suppress(Exception)` pattern for session cleanup in
`deserialize_user` (same class of problem, different code path)
[#57470](https://github.com/apache/airflow/issues/57470),
[#57859](https://github.com/apache/airflow/issues/57859) — Earlier reports of
the same session lifecycle problem
**Environment evidence:**
- `pool_pre_ping=true` is enabled, which means SQLAlchemy validates
connections before use — but `Session.remove()` bypasses this check since it
operates on an already-bound session
- MySQL `wait_timeout=28800` (8h) and `pool_recycle=60` should prevent most
stale connections, but edge cases (Aurora failover, network blips) can still
cause disconnections
<details>
<summary>Full error traceback</summary>
```
2026-02-20T05:50:24.526091553Z [error ] Exception in ASGI application
[airflow.providers.fab.auth_manager.fab_auth_manager]
loc=fab_auth_manager.py:243
Traceback (most recent call last):
File
"/home/airflow/.local/lib/python3.12/site-packages/uvicorn/protocols/http/h11_impl.py",
line 406, in run_asgi
result = await app(self.scope, self.receive, self.send)
File
"/home/airflow/.local/lib/python3.12/site-packages/uvicorn/middleware/proxy_headers.py",
line 60, in __call__
return await self.app(scope, receive, send)
File
"/home/airflow/.local/lib/python3.12/site-packages/starlette/middleware/base.py",
line 101, in __call__
response = await self.dispatch_func(request, call_next)
File
"/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/fab/auth_manager/fab_auth_manager.py",
line 243, in cleanup_session_middleware
settings.Session.remove()
File
"/home/airflow/.local/lib/python3.12/site-packages/sqlalchemy/orm/scoping.py",
line 246, in remove
self.registry().close()
File
"/home/airflow/.local/lib/python3.12/site-packages/sqlalchemy/orm/session.py",
line 2081, in close
self._close_impl(invalidate=False)
File
"/home/airflow/.local/lib/python3.12/site-packages/sqlalchemy/orm/session.py",
line 2124, in _close_impl
self.rollback()
File
"/home/airflow/.local/lib/python3.12/site-packages/sqlalchemy/orm/session.py",
line 1982, in rollback
self._transaction.rollback(_to_root=True)
File
"/home/airflow/.local/lib/python3.12/site-packages/sqlalchemy/orm/session.py",
line 1040, in rollback
self._connection_rollback(self._connections[transaction])
File
"/home/airflow/.local/lib/python3.12/site-packages/sqlalchemy/orm/session.py",
line 1092, in _connection_rollback
connection.rollback()
File
"/home/airflow/.local/lib/python3.12/site-packages/sqlalchemy/engine/base.py",
line 1065, in rollback
self._transaction.rollback()
File
"/home/airflow/.local/lib/python3.12/site-packages/sqlalchemy/engine/base.py",
line 1768, in rollback
self.connection._rollback_impl()
File
"/home/airflow/.local/lib/python3.12/site-packages/sqlalchemy/engine/base.py",
line 902, in _rollback_impl
self._handle_dbapi_exception(e, None, None, None, None)
File
"/home/airflow/.local/lib/python3.12/site-packages/sqlalchemy/engine/base.py",
line 2240, in _handle_dbapi_exception
raise sqlalchemy_exception.with_traceback(exc_info[2]) from e
File
"/home/airflow/.local/lib/python3.12/site-packages/sqlalchemy/engine/base.py",
line 899, in _rollback_impl
self.connection.dbapi_connection.rollback()
File
"/home/airflow/.local/lib/python3.12/site-packages/MySQLdb/connections.py",
line 272, in rollback
self.query("ROLLBACK")
File
"/home/airflow/.local/lib/python3.12/site-packages/MySQLdb/connections.py",
line 260, in query
_mysql.connection.query(self, query)
MySQLdb.OperationalError: (2006, 'Server has gone away')
```
</details>
### Are you willing to submit PR?
- [x] Yes I am willing to submit a PR!
### Code of Conduct
- [x] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]