anandsummer28 commented on issue #57470:
URL: https://github.com/apache/airflow/issues/57470#issuecomment-3909028849

   > Is there a way to reproduce the error instead of waiting to reappear after 
few days ?
   
   Posting this in case it helps others hit by the same error. We saw 
`PendingRollbackError` on login/UI refresh/logout with **Airflow 3.1.1** (API 
server). Below: how we reproduced it and what fixed it.
   
   **Environment:** Airflow 3.1.1, API server on Kubernetes (same behavior 
should apply to non-K8s if you can set pool recycle and terminate DB 
connections).
   
   ---
   
   ## How to reproduce
   
   1. **Shorten the DB pool recycle** so connections age quickly (replace 
`default` with your namespace if needed):
      ```bash
      kubectl set env deployment/airflow-api-server \
        AIRFLOW__DATABASE__SQL_ALCHEMY_POOL_RECYCLE=30 \
        -n default
      ```
      Wait for the rollout to finish.
   
   2. **Generate load and "idle in transaction" connections** by calling the 
auth token API in a loop (replace `YOUR_AIRFLOW_URL` with your API base URL):
      ```bash
      AIRFLOW_URL="YOUR_AIRFLOW_URL"   # e.g. https://your-elb.amazonaws.com
      for i in $(seq 1 80); do
        CODE=$(curl -s -o /dev/null -w "%{http_code}" -X POST 
"$AIRFLOW_URL/auth/token" \
          -H "Content-Type: application/json" \
          -d '{"username": "admin", "password": "YOURPASSWORD"}')
        echo "$i $CODE"
        sleep 0.5
      done
      ```
      You’ll typically see `201` for most requests.
   
   3. **Connect to the Airflow metadata database** (Postgres) and list 
connections:
      ```sql
      SELECT pid, usename, application_name, state, state_change
      FROM pg_stat_activity
      WHERE datname = 'airflow';
      ```
      You should see some `idle` and/or `idle in transaction` rows from the app.
   
   4. **Terminate both idle and idle-in-transaction backends** (this leaves 
your current psql session alone):
      ```sql
      SELECT pg_terminate_backend(pid)
      FROM pg_stat_activity
      WHERE datname = 'airflow'
        AND state IN ('idle', 'idle in transaction')
        AND pid <> pg_backend_pid();
      ```
   
   5. **Trigger the error:**
      - Run the token loop again and/or  
      - In the Airflow UI (while already logged in), **refresh the page** or 
click **Logout**.
   
      You should see **500 Internal Server Error** and in the API server logs: 
`PendingRollbackError` and stack traces in `airflow.providers.fab.www.app` / 
`security_manager` / `get_user_by_id`. The UI can stay broken until the pod is 
restarted.
   
   **Note:** Terminating only `idle in transaction` was not enough for us; 
terminating **both** `idle` and `idle in transaction` made the failure 
reproducible.
   
   ---
   
   ## Fix that worked: FAB 3.0.2 + pool settings
   
   1. **Pin the FAB provider** to `apache-airflow-providers-fab==3.0.2` and 
ensure it is used at API server (and webserver, if used) startup.
   2. **Set DB pool options:**  
      `AIRFLOW__DATABASE__SQL_ALCHEMY_POOL_PRE_PING=True` and  
      `AIRFLOW__DATABASE__SQL_ALCHEMY_POOL_RECYCLE=1800`
   
   After applying this, the same reproduction steps either no longer hit the 
error or, when a 500 occurred, the UI recovered on the next request instead of 
staying stuck.
   
   **Example for the official Airflow Helm chart** (override `apiServer.args` / 
`apiServer.env` and same for `webserver` if you use it):
   
   ```yaml
   airflow:
     config:
       fab:
         enable_proxy_fix: "True"
         session_lifetime_minutes: "1440"
   
     apiServer:
       args:
         - bash
         - "-c"
         - |
           pip install --no-cache-dir "apache-airflow-providers-fab==3.0.2" \
           && exec airflow api-server
       env:
         - name: AIRFLOW__DATABASE__SQL_ALCHEMY_POOL_PRE_PING
           value: "True"
         - name: AIRFLOW__DATABASE__SQL_ALCHEMY_POOL_RECYCLE
           value: "1800"
   
     webserver:   # if you use the webserver
       args:
         - bash
         - "-c"
         - |
           pip install --no-cache-dir "apache-airflow-providers-fab==3.0.2" \
           && exec airflow webserver
       env:
         - name: AIRFLOW__DATABASE__SQL_ALCHEMY_POOL_PRE_PING
           value: "True"
         - name: AIRFLOW__DATABASE__SQL_ALCHEMY_POOL_RECYCLE
           value: "1800"
   ```
   
   The session-cleanup fixes for this issue are in 
**apache-airflow-providers-fab 3.1.2+**. If problems persist on 3.0.2, try 
`>=3.1.2` (e.g. `3.1.2` or `3.2.0`) with the same pool settings.
   
   ---
   
   ## Summary
   
   | Step | Action |
   |------|--------|
   | **Reproduce** | Set `POOL_RECYCLE=30`, hammer `/auth/token`, then in 
Postgres terminate **idle** and **idle in transaction** backends, then refresh 
UI or logout. |
   | **Fix** | Use `apache-airflow-providers-fab==3.0.2` (or `>=3.1.2`) at 
startup + `PRE_PING=True` and `POOL_RECYCLE=1800`. |
   
   Hope this helps anyone else hitting the same error.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to