rusackas opened a new pull request, #40234:
URL: https://github.com/apache/superset/pull/40234

   ### SUMMARY
   
   The E2E (Playwright + Cypress) workflows on master have a 12% flake rate (6 
of last 50 runs), and every failure looks the same:
   
   ```
   [testAssets] Failed to cleanup dashboard 13: Error: Missing CSRF token
     (Error: apiRequestContext.get: socket hang up
   ✘  Dashboard Export › should download ZIP and show success toast when 
clicking Export YAML
     Error: page.goto: net::ERR_CONNECTION_REFUSED at http://localhost:8081/...
     TimeoutError: page.waitForFunction: Timeout 15000ms exceeded
     Error: page.goto: net::ERR_ABORTED; maybe frame was detached?
   ```
   
   The trigger is always one of the heavy dashboard ZIP import/export tests:
   - `playwright/tests/dashboard/export.spec.ts:61` — *Dashboard Export › 
should download ZIP and show success toast when clicking Export YAML*
   - `playwright/tests/dashboard/dashboard-list.spec.ts:266` — *import 
dashboard › should import a dashboard from a zip file*
   
   Both `cypress-run-all` and `playwright-run` in 
[`.github/workflows/bashlib.sh`](https://github.com/apache/superset/blob/master/.github/workflows/bashlib.sh)
 start the backend with `flask run --no-debugger -p $port`. The Flask 
development server is single-threaded with no crash recovery, so when these 
tests overwhelm it the backend stays dead for the rest of the suite — every 
test afterward cascades into `ECONNREFUSED`.
   
   This PR switches both runners to **gunicorn** with the same shape used in 
[`docker/entrypoints/run-server.sh`](https://github.com/apache/superset/blob/master/docker/entrypoints/run-server.sh):
   
   - `--workers 4 --worker-class gthread --threads 20` — real concurrency, 
matching what production runs.
   - `--timeout 120` — kill stuck workers instead of letting them hang the 
entire suite.
   - `--max-requests 500 --max-requests-jitter 50` — recycle workers 
periodically so memory accumulation doesn't OOM the process.
   - `--access-logfile - --error-logfile -` — keeps the existing per-run log 
capture pattern.
   
   ### BEFORE/AFTER
   
   | | Before | After |
   |---|---|---|
   | Backend | `flask run --no-debugger` (single-threaded dev server) | 
`gunicorn` with 4 workers × 20 threads |
   | Crash recovery | None (one worker = the only worker) | Workers 
auto-recycle on crash + every 500 requests |
   | Timeout | None (hangs indefinitely) | 120s per worker |
   | Documented purpose | "development only" — Flask docs explicitly say not to 
use under load | Production-grade WSGI server |
   
   ### TESTING INSTRUCTIONS
   
   This PR is a CI-only change. To validate:
   
   1. Push a commit (this PR's CI run will exercise it on a real GitHub runner).
   2. Compare the failure rate of the `E2E / playwright-tests` and 
`cypress-matrix` jobs against the baseline (~12% on master right now).
   3. The cleanup `[testAssets] Failed to cleanup` warnings should largely 
disappear, since the backend stays alive throughout each test file.
   
   ### ADDITIONAL INFORMATION
   
   - [ ] Has associated issue
   - [ ] Required feature flags
   - [ ] Changes UI
   - [ ] Includes DB Migration
   - [ ] Introduces new feature or API
   - [ ] Removes existing feature or API
   
   Only frontend (JS) coverage is captured in E2E (verified — `bashlib.sh` only 
instruments JS assets via `build-instrumented-assets`), so multi-worker 
gunicorn doesn't break the existing coverage path.
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to