[PR] perf(filters): cache distinct column values to avoid re-running virtual dataset query per filter [superset]

via GitHub Sun, 07 Jun 2026 22:55:11 -0700


Abdulrehman-PIAIC80387 opened a new pull request, #40839:
URL: https://github.com/apache/superset/pull/40839


   ### SUMMARY
   
   Fixes #39342.
   
   The `/api/v1/datasource/<type>/<id>/column/<column>/values/` endpoint had 
**no caching**. When a dashboard has multiple native filters backed by the same 
virtual dataset, each filter fires its own HTTP request to this endpoint, and 
each request re-executes the wrapping query:
   
   ```sql
   SELECT DISTINCT <column>
   FROM (<virtual_dataset_SQL>)  -- heavy CTE re-runs every time
   ```
   
   If the virtual dataset has joins or aggregations that take a few seconds, 
every additional filter multiplied the cost — even a simple page reload re-ran 
every filter query from scratch.
   
   This PR adds a result cache around the call to 
`datasource.values_for_column(...)`, mirroring the exact pattern already in use 
at `superset/datasource/api.py:417` for the `get_compatible_metrics` endpoint.
   
   ### What this delivers
   
   - ✅ Repeated requests for the same column → cache hit (no DB query)
   - ✅ Dashboard reload → all filter values cached → instant
   - ✅ Two users on the same dashboard with same RLS → second user gets cached 
values
   - ✅ Editing the virtual dataset SQL auto-busts the cache (via `changed_on` 
in the cache key)
   - ✅ `?force=true` query param bypasses the cache
   - ✅ `X-Cache-Status: HIT` / `MISS` response header for operator debugging
   - ✅ Warning log when a single column produces a payload larger than 
`FILTER_VALUES_CACHE_WARN_THRESHOLD` (defaults to 100k rows) — early signal for 
cache-memory pressure
   
   ### Cache key
   
   ```python
   cache_key = "col_values:" + sha256({
       "uid": datasource.uid,             # which dataset
       "col": column_name,                # which column
       "limit": row_limit,                # respects FILTER_SELECT_ROW_LIMIT
       "denorm": denormalize_column,      # respects per-dataset 
normalize_columns
       "user": get_user_id(),             # RLS isolation — see notes
       "changed_on": str(datasource.changed_on),  # auto-bust on dataset edit
   })
   ```
   
   **Security note:** Row-Level Security filters are per-user, so the `user` 
field in the cache key is required to prevent user A's RLS-restricted values 
from leaking to user B.
   
   **Correctness note:** `changed_on` busts the cache when someone edits the 
underlying virtual dataset SQL — no manual invalidation needed.
   
   ### TTL
   
   Uses the same logic as the existing `get_compatible_metrics` endpoint at 
api.py:417:
   
   ```python
   timeout = datasource.cache_timeout or 
app.config.get("CACHE_DEFAULT_TIMEOUT", 300)
   ```
   
   Per-dataset `cache_timeout` overrides; otherwise the configured default (5 
minutes).
   
   ### Performance impact
   
   The performance claim is **theoretical** — based on cache-hit avoidance of 
the wrapping query, not benchmarked against a production-scale virtual dataset. 
The functional correctness is validated by tests; the operational impact will 
depend on the deployment's filter cardinality and query cost.
   
   ### What this does NOT deliver (and why)
   
   - **A single dashboard load with N filters on N _different_ columns still 
issues N database queries.** This PR helps repeat-hits across loads and 
per-user dashboard reloads, but a fresh first-load with N unique columns still 
hits the database N times. The architectural fix for the "N filters → 1 query" 
case requires a new bulk endpoint that builds one CTE-based query for all 
columns _plus_ frontend coordination to call the new endpoint. Lays the 
groundwork for that follow-up.
   - **No cache-stampede protection.** Concurrent first-load requests for the 
same key can each miss the cache and hit the DB. Adding this correctly across 
cache backends (Redis vs Memcached vs SimpleCache) is non-trivial; left as a 
follow-up.
   
   ### TESTING INSTRUCTIONS
   
   ```bash
   pytest tests/integration_tests/datasource/api_tests.py -v
   ```
   
   Six tests cover the cache behavior:
   
   | Test | Asserts |
   |---|---|
   | `test_get_column_values_cache_hit_skips_query` | Two identical requests → 
`values_for_column` called exactly once |
   | `test_get_column_values_force_bypasses_cache` | `?force=true` re-runs the 
query even with a hot cache entry |
   | `test_get_column_values_cache_isolated_per_column` | Two different columns 
→ no cross-column leak |
   | `test_get_column_values_cache_isolated_per_user` | Two users → each 
populates own entry (RLS safety) |
   | `test_get_column_values_cache_busts_on_changed_on` | Bumping `changed_on` 
invalidates the cache key |
   | `test_get_column_values_response_advertises_cache_status` | 
`X-Cache-Status` header reports MISS then HIT |
   
   All 14 pre-existing endpoint tests still pass — no regression.
   
   Manual verification:
   
   1. Configure a virtual dataset with a heavy SQL (joins/CTEs)
   2. Add 3+ native filters on different columns of that dataset to a dashboard
   3. Load the dashboard, then reload it
   4. Inspect `/api/v1/datasource/.../values/` requests in browser devtools — 
second-load responses should carry `X-Cache-Status: HIT`
   5. Hit any endpoint with `?force=true` — response should carry 
`X-Cache-Status: MISS`
   
   ### ADDITIONAL INFORMATION
   
   - [x] Has associated issue: #39342
   - [ ] Required feature flags:
   - [ ] Changes UI
   - [ ] Includes DB Migration (please mention any potential downtime)
   - [ ] Migration is atomic, supports rollback & is backwards-compatible
   - [x] Confirm DB migration upgrade and downgrade tested
   - [x] Runtime introduces no new dependencies / changes existing dependencies
   - [ ] Introduces new feature or API
   - [x] Removes existing feature or API


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] perf(filters): cache distinct column values to avoid re-running virtual dataset query per filter [superset]

Reply via email to