pcorliss opened a new pull request, #68100:
URL: https://github.com/apache/airflow/pull/68100
Fix periodic 500 due to idle_in_transaction timeout in FAB auth manager
When we upgraded to Airflow 3.1.8 (apache-airflow-providers-fab 3.4.0)
we noticed periodic 500s being returned when connecting to certain API
endpoints or browsing the GUI. We traced this back to a SQL error on
the api-server.
```
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) server closed
the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
```
(Subsequent requests reusing the same `scoped_session` then cascade into
`sqlalchemy.exc.PendingRollbackError: Can't reconnect until invalid
transaction is rolled back` until the session is removed.)
On investigation we found that SQL queries were being left open in an
`idle_in_transaction` state and would time out due to our PostgreSQL
settings. In our environment we have our PostgreSQL idle in transaction
timeout set to 5 minutes (`idle_in_transaction_session_timeout = '5min'`
in `postgresql.conf`). The PostgreSQL default is `0`
("never time out"), so deployments using the default will not see this
bug; you can confirm the live value with `SHOW
idle_in_transaction_session_timeout;`.
We traced this back to this change:
- https://github.com/apache/airflow/pull/60274
We think the removal of create_session() is what removed the commit/rollback
handler.
We also noticed several follow-up fixes and tried upgrading to Airflow
3.2.2 (apache-airflow-providers-fab 3.6.4), but the errors continued
because none of them fix the root cause (the open transaction left
behind on a worker thread):
- https://github.com/apache/airflow/pull/61480
- https://github.com/apache/airflow/pull/62153
- https://github.com/apache/airflow/pull/61943
- https://github.com/apache/airflow/pull/62336
- https://github.com/apache/airflow/pull/62919
- closes: https://github.com/apache/airflow/issues/62402
### Repro Steps
#### Deployed Instance
0. Set PostgreSQL idle_in_transaction_session_timeout - `ALTER SYSTEM SET
idle_in_transaction_session_timeout = '10s'; SELECT pg_reload_conf();`
1. Clear all existing cookies or use an incognito window.
2. Login via `/auth/login`.
3. Navigate to `/api/v2/version` — expected 200 response. An
`idle_in_transaction` query is opened in the database (visible via
`SELECT pid, state, now()-state_change AS idle_for, LEFT(query, 80) FROM
pg_stat_activity WHERE state='idle in transaction';`).
4. Take no action for at least the duration of PostgreSQL's
`idle_in_transaction_session_timeout`.
5. Navigate to `/api/v2/version` again — expected 500 response with
`OperationalError` / `PendingRollbackError` in the api-server logs.
#### Breeze Script
Local breeze configuration in
`files/airflow-breeze-config/environment_variables.env`:
```
AIRFLOW__FAB__CACHE_TTL=0
AIRFLOW__API__WORKERS=1
```
`AIRFLOW__FAB__CACHE_TTL=0` disables the TTL cache so every request
exercises the cache-miss path. `AIRFLOW__API__WORKERS=1` keeps a single
gunicorn worker so the same anyio worker thread holds the leaked
connection across requests.
Start breeze with PostgreSQL and shorten the idle-in-transaction
timeout so the bug is observable in seconds:
```bash
breeze start-airflow --backend postgres \
--postgres-version 16 \
--integration none \
--load-default-connections \
--load-example-dags
# In a separate shell, on the host:
docker exec -u postgres breeze-postgres-1 \
psql -d airflow -c \
"ALTER SYSTEM SET idle_in_transaction_session_timeout = '10s'; \
SELECT pg_reload_conf();"
```
Then run the minimal repro shell script
```bash
#!/usr/bin/env bash
# Reproduce the FAB deserialize_user idle-in-transaction leak.
set -euo pipefail
API="${API:-http://localhost:28080}"
USERNAME="${USERNAME:-admin}"
PASSWORD="${PASSWORD:-admin}"
WAIT_SECONDS="${WAIT_SECONDS:-15}"
JAR=$(mktemp)
trap 'rm -f "$JAR"' EXIT
step() { printf "\n\033[1;34m== %s ==\033[0m\n" "$*"; }
step "1. Authenticate via /auth/login/"
LOGIN_HTML=$(curl -fsS -c "$JAR" -b "$JAR" "$API/auth/login/")
CSRF=$(printf '%s' "$LOGIN_HTML" \
| grep -oE 'name="csrf_token"[^>]*value="[^"]*"' \
| head -1 | sed -E 's/.*value="([^"]*)".*/\1/')
curl -fsS -o /dev/null -c "$JAR" -b "$JAR" -X POST "$API/auth/login/" \
-H 'Content-Type: application/x-www-form-urlencoded' \
--data-urlencode "username=$USERNAME" \
--data-urlencode "password=$PASSWORD" \
--data-urlencode "csrf_token=$CSRF"
step "2. Warm GET /api/v2/version (leaks idle-in-tx scoped_session on worker
thread)"
curl -fsS -o /dev/null -w 'HTTP %{http_code}\n' -b "$JAR"
"$API/api/v2/version"
step "3. Confirm idle-in-transaction connection holding the FAB User JOIN"
docker exec breeze-postgres-1 psql -U postgres -d airflow -c \
"SELECT pid, state, now()-state_change AS idle_for, LEFT(query, 80) \
FROM pg_stat_activity \
WHERE datname='airflow' AND state='idle in transaction';"
step "4. Sleep ${WAIT_SECONDS}s (> idle_in_transaction_session_timeout)"
sleep "$WAIT_SECONDS"
step "5. Probe GET /api/v2/version — pre-fix: 500; post-fix: 200"
HTTP=$(curl -s -o /tmp/repro-resp.txt -w '%{http_code}' -b "$JAR"
"$API/api/v2/version")
echo "HTTP $HTTP"
if [[ "$HTTP" == "500" ]]; then
echo -e "\n\033[1;31mLEAK REPRODUCED\033[0m"
exit 1
else
echo -e "\n\033[1;32mNO LEAK\033[0m"
exit 0
fi
```
Pre-fix the script exits non-zero with HTTP 500 on step 5. Post-fix it
exits 0 with HTTP 200 and no `idle in transaction` row visible at
step 3.
---
##### Was generative AI tooling used to co-author this PR?
- [X] Yes — Claude Code (Opus 4.7)
Generated-by: Claude Code (Opus 4.7) following [the
guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#gen-ai-assisted-contributions)
---
* Read the **[Pull Request
Guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#pull-request-guidelines)**
for more information. Note: commit author/co-author name and email in commits
become permanently public when merged.
* For fundamental code changes, an Airflow Improvement Proposal
([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals))
is needed.
* When adding dependency, check compliance with the [ASF 3rd Party License
Policy](https://www.apache.org/legal/resolved.html#category-x).
* For significant user-facing changes create newsfragment:
`{pr_number}.significant.rst`, in
[airflow-core/newsfragments](https://github.com/apache/airflow/tree/main/airflow-core/newsfragments).
You can add this file in a follow-up commit after the PR is created so you
know the PR number.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]