mikebridge commented on PR #39859: URL: https://github.com/apache/superset/pull/39859#issuecomment-4400676572
## Migration runtime — empirical numbers
Following up on the maintenance-window concern: I set up a stress-test data
generator (`scripts/seed_junction_load.py` in this PR) to populate the four
largest association tables at varying scales, then timed the migration's
downgrade + upgrade at each scale. Numbers below are from MySQL 8 in a local
Docker container — production Postgres on dedicated hardware should be at
least as fast, likely faster.
### Data
| `dashboard_slices` | Total junction rows | Downgrade | Upgrade | Per-row
cost (upgrade) |
|---|---|---|---|---|
| 100K | ~111K | 2s | 1s | ~9 µs |
| 1M | ~1.11M | 11s | 8s | ~7.2 µs |
| 5M | ~5.55M | 53s | 45s | ~8.1 µs |
| 10M | ~11.1M | 1m 51s | 1m 37s | ~8.7 µs |
The "Total junction rows" column adds the other three seeded tables
(`slice_user`, `dashboard_user`, `dashboard_roles`) at proportionally smaller
targets.
### Scaling
| Row factor | Downgrade factor | Upgrade factor |
|---|---|---|
| 100K → 1M (10×) | 5.5× | 8× |
| 1M → 5M (5×) | 4.8× | 5.6× |
| 5M → 10M (2×) | 2.1× | 2.16× |
**Linear scaling at production-relevant sizes.** The 100K→1M ratio is
sublinear because Alembic/Flask startup cost dominates at small N; from 1M
onward, data-proportional work dominates and the time ratio matches the row
ratio almost exactly. No memory cliff observed up through 10M (no `work_mem` /
buffer-pool spill, no `O(N²)` regime).
Per-row cost stabilises around **~8–9 µs per junction row** in the upgrade
direction. The dominant cost component is the composite-PK index build;
downgrade is 10–15% slower than upgrade due to the MySQL-specific FK
drop/re-add overhead.
### Extrapolation to larger deployments
| `dashboard_slices` rows | Predicted upgrade time |
|---|---|
| 50M | ~7–8 min |
| 100M | ~14–15 min |
| 500M | ~70–75 min |
### Caveats
- Numbers are MySQL on Docker on macOS — real production Postgres on
dedicated hardware will likely be faster (better disk I/O, larger buffer pool,
no Docker overhead). Take these as a **pessimistic upper bound**.
- We tested proportional smaller sizes for the other three tables. If one
of those is much larger in a given deployment (e.g., `slice_user` at 50M
because of a multi-team ownership pattern), add ~9 µs per row of that table to
the estimate.
- No memory cliff observed through 10M, but very large N (100M+) on a
deployment with constrained `work_mem` could introduce a step-change. The
diagnostic queries in `UPDATING.md` (per-table size, lock-window estimate) help
size the window for an actual deployment.
### Reproducing locally
The seed script is in this PR and is backend-agnostic (works on Postgres /
MySQL / SQLite). To benchmark against your own deployment shape:
```bash
docker exec <superset-container> /app/.venv/bin/python
/app/scripts/seed_junction_load.py \
--dashboard-slices 5000000 --slice-user 500000 --dashboard-user 500000
--dashboard-roles 50000
time docker exec <superset-container> superset db downgrade 33d7e0e21daa
time docker exec <superset-container> superset db upgrade
```
Idempotent — re-running with a higher target adds rows up to the new total.
See the script header for full options.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
