Abdulrehman-PIAIC80387 opened a new pull request, #40839:
URL: https://github.com/apache/superset/pull/40839
### SUMMARY
Fixes #39342.
The `/api/v1/datasource/<type>/<id>/column/<column>/values/` endpoint had
**no caching**. When a dashboard has multiple native filters backed by the same
virtual dataset, each filter fires its own HTTP request to this endpoint, and
each request re-executes the wrapping query:
```sql
SELECT DISTINCT <column>
FROM (<virtual_dataset_SQL>) -- heavy CTE re-runs every time
```
If the virtual dataset has joins or aggregations that take a few seconds,
every additional filter multiplied the cost — even a simple page reload re-ran
every filter query from scratch.
This PR adds a result cache around the call to
`datasource.values_for_column(...)`, mirroring the exact pattern already in use
at `superset/datasource/api.py:417` for the `get_compatible_metrics` endpoint.
### What this delivers
- ✅ Repeated requests for the same column → cache hit (no DB query)
- ✅ Dashboard reload → all filter values cached → instant
- ✅ Two users on the same dashboard with same RLS → second user gets cached
values
- ✅ Editing the virtual dataset SQL auto-busts the cache (via `changed_on`
in the cache key)
- ✅ `?force=true` query param bypasses the cache
- ✅ `X-Cache-Status: HIT` / `MISS` response header for operator debugging
- ✅ Warning log when a single column produces a payload larger than
`FILTER_VALUES_CACHE_WARN_THRESHOLD` (defaults to 100k rows) — early signal for
cache-memory pressure
### Cache key
```python
cache_key = "col_values:" + sha256({
"uid": datasource.uid, # which dataset
"col": column_name, # which column
"limit": row_limit, # respects FILTER_SELECT_ROW_LIMIT
"denorm": denormalize_column, # respects per-dataset
normalize_columns
"user": get_user_id(), # RLS isolation — see notes
"changed_on": str(datasource.changed_on), # auto-bust on dataset edit
})
```
**Security note:** Row-Level Security filters are per-user, so the `user`
field in the cache key is required to prevent user A's RLS-restricted values
from leaking to user B.
**Correctness note:** `changed_on` busts the cache when someone edits the
underlying virtual dataset SQL — no manual invalidation needed.
### TTL
Uses the same logic as the existing `get_compatible_metrics` endpoint at
api.py:417:
```python
timeout = datasource.cache_timeout or
app.config.get("CACHE_DEFAULT_TIMEOUT", 300)
```
Per-dataset `cache_timeout` overrides; otherwise the configured default (5
minutes).
### Performance impact
The performance claim is **theoretical** — based on cache-hit avoidance of
the wrapping query, not benchmarked against a production-scale virtual dataset.
The functional correctness is validated by tests; the operational impact will
depend on the deployment's filter cardinality and query cost.
### What this does NOT deliver (and why)
- **A single dashboard load with N filters on N _different_ columns still
issues N database queries.** This PR helps repeat-hits across loads and
per-user dashboard reloads, but a fresh first-load with N unique columns still
hits the database N times. The architectural fix for the "N filters → 1 query"
case requires a new bulk endpoint that builds one CTE-based query for all
columns _plus_ frontend coordination to call the new endpoint. Lays the
groundwork for that follow-up.
- **No cache-stampede protection.** Concurrent first-load requests for the
same key can each miss the cache and hit the DB. Adding this correctly across
cache backends (Redis vs Memcached vs SimpleCache) is non-trivial; left as a
follow-up.
### TESTING INSTRUCTIONS
```bash
pytest tests/integration_tests/datasource/api_tests.py -v
```
Six tests cover the cache behavior:
| Test | Asserts |
|---|---|
| `test_get_column_values_cache_hit_skips_query` | Two identical requests →
`values_for_column` called exactly once |
| `test_get_column_values_force_bypasses_cache` | `?force=true` re-runs the
query even with a hot cache entry |
| `test_get_column_values_cache_isolated_per_column` | Two different columns
→ no cross-column leak |
| `test_get_column_values_cache_isolated_per_user` | Two users → each
populates own entry (RLS safety) |
| `test_get_column_values_cache_busts_on_changed_on` | Bumping `changed_on`
invalidates the cache key |
| `test_get_column_values_response_advertises_cache_status` |
`X-Cache-Status` header reports MISS then HIT |
All 14 pre-existing endpoint tests still pass — no regression.
Manual verification:
1. Configure a virtual dataset with a heavy SQL (joins/CTEs)
2. Add 3+ native filters on different columns of that dataset to a dashboard
3. Load the dashboard, then reload it
4. Inspect `/api/v1/datasource/.../values/` requests in browser devtools —
second-load responses should carry `X-Cache-Status: HIT`
5. Hit any endpoint with `?force=true` — response should carry
`X-Cache-Status: MISS`
### ADDITIONAL INFORMATION
- [x] Has associated issue: #39342
- [ ] Required feature flags:
- [ ] Changes UI
- [ ] Includes DB Migration (please mention any potential downtime)
- [ ] Migration is atomic, supports rollback & is backwards-compatible
- [x] Confirm DB migration upgrade and downgrade tested
- [x] Runtime introduces no new dependencies / changes existing dependencies
- [ ] Introduces new feature or API
- [x] Removes existing feature or API
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]