antonlin1 opened a new issue, #66493:
URL: https://github.com/apache/airflow/issues/66493
### Apache Airflow version
3.2.0 (introduced in commit b3306f15cd, "AIP-84: Add JWT token revokation
for logout invalidation", PR #61339 / #47952)
### What happened?
Every authenticated API request now performs a synchronous DB query inside
the FastAPI auth dependency:
```python
# airflow-core/src/airflow/api_fastapi/auth/managers/base_auth_manager.py:153
if (jti := payload.get("jti")) and RevokedToken.is_revoked(jti):
raise InvalidTokenError("Token has been revoked")
```
`RevokedToken.is_revoked`
([revoked_token.py:58-61](https://github.com/apache/airflow/blob/main/airflow-core/src/airflow/models/revoked_token.py#L58-L61))
runs `session.scalar(...)` via `@provide_session`, holding a SQLAlchemy
connection per in-flight request. With the default pool of `5+10=15` shared
across api-server, scheduler, dag-processor, and triggerer, modest concurrent
load (UI multi-endpoint polling, fan-out DAGs) exhausts the pool and request
handlers time out in `QueuePool._do_get` after 30s.
The UI freezes once a few task instances start running because every poll
request blocks on connection checkout. Stacktrace from a stock 3.2.0 standalone:
```
File "airflow/api_fastapi/auth/managers/base_auth_manager.py", line 153, in
get_user_from_token
if (jti := payload.get("jti")) and RevokedToken.is_revoked(jti):
File "airflow/utils/session.py", line 100, in wrapper
return func(*args, session=session, **kwargs)
File "airflow/models/revoked_token.py", line 61, in is_revoked
return bool(session.scalar(select(exists().where(cls.jti == jti))))
...
File "sqlalchemy/pool/impl.py", line 166, in _do_get
raise exc.TimeoutError(
sqlalchemy.exc.TimeoutError: QueuePool limit of size 5 overflow 10 reached,
connection timed out, timeout 30.00
```
`is_revoked` was the bottom of the failing stack on every endpoint we hit:
`/ui/config`, `/ui/backfills`, `/api/v2/dags/.../details`,
`/api/v2/dags/.../dagRuns/...`. Multi-second `duration_us` values (60s, 90s,
120s, 150s) come from FastAPI resolving the auth dependency multiple times in
the same handler — each checkout times out at 30s independently.
### What you think should happen instead?
Cache hit rate on `is_revoked` is ≈100% in practice — revocation only
happens on explicit logout. The check should not require a DB roundtrip on
every request. An in-process TTL cache (with bounded staleness across uvicorn
workers) collapses the per-request DB roundtrip into a near-free in-memory
lookup.
### How to reproduce
Stock 3.2.0 with the default config:
```bash
pip install apache-airflow==3.2.0
AIRFLOW__DATABASE__SQL_ALCHEMY_POOL_SIZE=3 \
AIRFLOW__DATABASE__SQL_ALCHEMY_MAX_OVERFLOW=2 \
airflow standalone &
# Get a JWT (admin password is in
~/airflow/simple_auth_manager_passwords.json.generated):
JWT=$(curl -sf -X POST http://localhost:8080/auth/token \
-H 'Content-Type: application/json' \
-d '{"username":"admin","password":"<PWD>"}' \
| python3 -c 'import json,sys;print(json.load(sys.stdin)["access_token"])')
# 60 concurrent requests against an authenticated DB-backed endpoint:
seq 60 | xargs -P 30 -I {} curl -sS -o /dev/null \
-w "%{http_code} %{time_total}s\n" \
-H "Authorization: Bearer $JWT" \
http://localhost:8080/api/v2/dags
```
Observed: ~12% of requests return 500 with `QueuePool TimeoutError`, p50
latency ~15s, p99 ~30s.
### Operating System
macOS 24.4.0 (also reproduces on Linux per the standard SQLAlchemy/uvicorn
pool dynamics).
### Versions of Apache Airflow Providers
N/A — affects core auth path.
### Deployment
Standalone (issue is independent of deployment topology — affects any
uvicorn-driven api-server with a shared SQLAlchemy pool).
### Deployment details
- Default `[database]` pool config (5+10) reproduces with sufficient
concurrent request volume; deliberately small pool (3+2) makes it deterministic.
- Backed by SQLite (default standalone) but the failure is at the
pool-checkout level, not SQLite-specific.
### Anything else?
- Bisected to commit b3306f15cd (PR #47952 / #61339, "AIP-84: Add JWT token
revokation for logout invalidation"). Pre-3.2 the auth path did no DB work.
- Pull request with proposed fix coming next (in-process
`cachetools.TTLCache` mirroring the existing `DBDagBag` cache pattern in
`airflow/models/dagbag.py`). Cache hit rate is ≈100% in practice (revocation is
rare). Local before/after with the fix at the same `pool_size=3,
max_overflow=2`: 60 requests / 30 concurrent dropped from 31s wall + 12%
timeouts + 30s p99 → 1.04s wall + 0% timeouts + 0.6s p99.
### Are you willing to submit PR?
- [x] Yes I am willing to submit a PR!
### Code of Conduct
- [x] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]