rusackas opened a new pull request, #40234:
URL: https://github.com/apache/superset/pull/40234
### SUMMARY
The E2E (Playwright + Cypress) workflows on master have a 12% flake rate (6
of last 50 runs), and every failure looks the same:
```
[testAssets] Failed to cleanup dashboard 13: Error: Missing CSRF token
(Error: apiRequestContext.get: socket hang up
✘ Dashboard Export › should download ZIP and show success toast when
clicking Export YAML
Error: page.goto: net::ERR_CONNECTION_REFUSED at http://localhost:8081/...
TimeoutError: page.waitForFunction: Timeout 15000ms exceeded
Error: page.goto: net::ERR_ABORTED; maybe frame was detached?
```
The trigger is always one of the heavy dashboard ZIP import/export tests:
- `playwright/tests/dashboard/export.spec.ts:61` — *Dashboard Export ›
should download ZIP and show success toast when clicking Export YAML*
- `playwright/tests/dashboard/dashboard-list.spec.ts:266` — *import
dashboard › should import a dashboard from a zip file*
Both `cypress-run-all` and `playwright-run` in
[`.github/workflows/bashlib.sh`](https://github.com/apache/superset/blob/master/.github/workflows/bashlib.sh)
start the backend with `flask run --no-debugger -p $port`. The Flask
development server is single-threaded with no crash recovery, so when these
tests overwhelm it the backend stays dead for the rest of the suite — every
test afterward cascades into `ECONNREFUSED`.
This PR switches both runners to **gunicorn** with the same shape used in
[`docker/entrypoints/run-server.sh`](https://github.com/apache/superset/blob/master/docker/entrypoints/run-server.sh):
- `--workers 4 --worker-class gthread --threads 20` — real concurrency,
matching what production runs.
- `--timeout 120` — kill stuck workers instead of letting them hang the
entire suite.
- `--max-requests 500 --max-requests-jitter 50` — recycle workers
periodically so memory accumulation doesn't OOM the process.
- `--access-logfile - --error-logfile -` — keeps the existing per-run log
capture pattern.
### BEFORE/AFTER
| | Before | After |
|---|---|---|
| Backend | `flask run --no-debugger` (single-threaded dev server) |
`gunicorn` with 4 workers × 20 threads |
| Crash recovery | None (one worker = the only worker) | Workers
auto-recycle on crash + every 500 requests |
| Timeout | None (hangs indefinitely) | 120s per worker |
| Documented purpose | "development only" — Flask docs explicitly say not to
use under load | Production-grade WSGI server |
### TESTING INSTRUCTIONS
This PR is a CI-only change. To validate:
1. Push a commit (this PR's CI run will exercise it on a real GitHub runner).
2. Compare the failure rate of the `E2E / playwright-tests` and
`cypress-matrix` jobs against the baseline (~12% on master right now).
3. The cleanup `[testAssets] Failed to cleanup` warnings should largely
disappear, since the backend stays alive throughout each test file.
### ADDITIONAL INFORMATION
- [ ] Has associated issue
- [ ] Required feature flags
- [ ] Changes UI
- [ ] Includes DB Migration
- [ ] Introduces new feature or API
- [ ] Removes existing feature or API
Only frontend (JS) coverage is captured in E2E (verified — `bashlib.sh` only
instruments JS assets via `build-instrumented-assets`), so multi-worker
gunicorn doesn't break the existing coverage path.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]