Re: [PR] perf(filters): cache distinct column values to avoid re-running virtual dataset query per filter [superset]

via GitHub Sun, 07 Jun 2026 23:08:49 -0700


codeant-ai-for-open-source[bot] commented on code in PR #40839:
URL: https://github.com/apache/superset/pull/40839#discussion_r3371118934



##########
superset/datasource/api.py:
##########
@@ -124,13 +131,48 @@ def get_column_values(
 
         row_limit = apply_max_row_limit(app.config["FILTER_SELECT_ROW_LIMIT"])
         denormalize_column = not datasource.normalize_columns
+
+        # Cache distinct column-value results so a dashboard with many filters
+        # backed by the same (often heavy) virtual dataset doesn't re-execute
+        # the wrapping query per filter (#39342). The cache key includes the
+        # user id so RLS-filtered datasources can't leak values across users,
+        # and the dataset's ``changed_on`` so an edit to the underlying SQL
+        # busts cached entries on the next request.
+        force = parse_boolean_string(request.args.get("force"))
+        cache_key = (
+            "col_values:"
+            + hashlib.sha256(
+                json.dumps(
+                    {
+                        "uid": datasource.uid,
+                        "col": column_name,
+                        "limit": row_limit,
+                        "denorm": denormalize_column,
+                        "user": get_user_id(),
+                        "changed_on": str(getattr(datasource, "changed_on", 
"")),

Review Comment:
   **Suggestion:** The cache key only varies by dataset, column, limit, 
denormalization flag, user id, and dataset `changed_on`, but it does not vary 
when row-level-security policies themselves change. If an RLS rule is 
tightened/changed for the same user, stale cached values can still be returned 
until TTL expiry, leaking values that should now be filtered out. Include an 
RLS policy fingerprint (or latest RLS update timestamp/role signature) in the 
key, or bypass this cache when RLS filters are active. [security]
   
   <details>
   <summary><b>Severity Level:</b> Major ⚠️</summary>
   
   ```mdx
   - ❌ Column-values endpoint serves data with outdated RLS filters.
   - ❌ Users see values no longer allowed by tightened RLS.
   - ⚠️ Dashboard native filters leak stale distinct dimension members.
   ```
   </details>
   <details>
   <summary><b>Steps of Reproduction ✅ </b></summary>
   
   ```mdx
   1. Configure a dataset with row-level security filters using the RLS REST 
API at
   `superset/row_level_security/api.py:64-76`, which persists 
`RowLevelSecurityFilter`
   entities defined in `superset/connectors/sqla/models.py:2234-2313` without 
modifying any
   `SqlaTable` `changed_on` field.
   
   2. Log in as a user subject to those RLS filters and issue `GET
   /api/v1/datasource/table/<id>/column/<column>/values/`, which hits
   `DatasourceRestApi.get_column_values` in `superset/datasource/api.py:52-67` 
and calls
   `datasource.values_for_column(...)` (implemented in
   `superset/models/helpers.py:2784-2849`), where 
`get_sqla_row_level_filters()` is applied
   at `helpers.py:51-53` to enforce RLS in the generated SQL.
   
   3. Observe that on this first call the code at 
`superset/datasource/api.py:141-157`
   computes a cache key from `{"uid": datasource.uid, "col": column_name, 
"limit": row_limit,
   "denorm": denormalize_column, "user": get_user_id(), "changed_on": 
str(getattr(datasource,
   "changed_on", ""))}` and, after executing the query, stores the RLS-filtered 
payload via
   `cache_manager.data_cache.set(cache_key, payload, timeout=timeout)` at 
`api.py:203-207`.
   
   4. While the cache TTL (defaulting to `CACHE_DEFAULT_TIMEOUT` 300s) is still 
active,
   tighten or otherwise change the RLS rule for this dataset via `RLSRestApi` 
(which updates
   `RowLevelSecurityFilter.changed_on` but not `datasource.changed_on`), then 
call the same
   column-values endpoint again as the same user; because `get_column_values` 
only keys on
   `uid`, `col`, `limit`, `denorm`, `user`, and `datasource.changed_on` (and 
does not include
   any RLS state), the code at `superset/datasource/api.py:159-168` returns the 
cached
   payload and never re-runs `values_for_column()`, so the response still 
contains values
   allowed under the old RLS policy but now disallowed, until the cache entry 
expires.
   ```
   </details>
   
   [Fix in 
Cursor](https://app.codeant.ai/fix-in-ide?tool=cursor&prompt_id=710b77e2d7574c21b84778eebe960c6f&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
 | [Fix in VSCode 
Claude](https://app.codeant.ai/fix-in-ide?tool=vscode-claude&prompt_id=710b77e2d7574c21b84778eebe960c6f&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
   
   *(Use Cmd/Ctrl + Click for best experience)*
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/datasource/api.py
   **Line:** 151:152
   **Comment:**
        *Security: The cache key only varies by dataset, column, limit, 
denormalization flag, user id, and dataset `changed_on`, but it does not vary 
when row-level-security policies themselves change. If an RLS rule is 
tightened/changed for the same user, stale cached values can still be returned 
until TTL expiry, leaking values that should now be filtered out. Include an 
RLS policy fingerprint (or latest RLS update timestamp/role signature) in the 
key, or bypass this cache when RLS filters are active.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   Once fix is implemented, also check other comments on the same PR, and ask 
user if the user wants to fix the rest of the comments as well. if said yes, 
then fetch all the comments validate the correctness and implement a minimal fix
   ```
   </details>
   <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40839&comment_hash=0ff7b4987462cb957ecedd8485ddbf827e6f7b92a46cf5f05ef28227899cba71&reaction=like'>👍</a>
 | <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40839&comment_hash=0ff7b4987462cb957ecedd8485ddbf827e6f7b92a46cf5f05ef28227899cba71&reaction=dislike'>👎</a>



##########
superset/datasource/api.py:
##########
@@ -124,13 +131,48 @@ def get_column_values(
 
         row_limit = apply_max_row_limit(app.config["FILTER_SELECT_ROW_LIMIT"])
         denormalize_column = not datasource.normalize_columns
+
+        # Cache distinct column-value results so a dashboard with many filters
+        # backed by the same (often heavy) virtual dataset doesn't re-execute
+        # the wrapping query per filter (#39342). The cache key includes the
+        # user id so RLS-filtered datasources can't leak values across users,
+        # and the dataset's ``changed_on`` so an edit to the underlying SQL
+        # busts cached entries on the next request.
+        force = parse_boolean_string(request.args.get("force"))
+        cache_key = (
+            "col_values:"
+            + hashlib.sha256(
+                json.dumps(
+                    {
+                        "uid": datasource.uid,
+                        "col": column_name,
+                        "limit": row_limit,
+                        "denorm": denormalize_column,
+                        "user": get_user_id(),
+                        "changed_on": str(getattr(datasource, "changed_on", 
"")),
+                    },
+                    sort_keys=True,
+                ).encode()
+            ).hexdigest()
+        )
+
+        if (
+            not force
+            and (cached := cache_manager.data_cache.get(cache_key)) is not None
+        ):
+            logger.debug(
+                "column-values cache HIT: uid=%s col=%s", datasource.uid, 
column_name
+            )
+            response = self.response(200, result=cached)
+            response.headers["X-Cache-Status"] = "HIT"
+            return response
+
         try:
             payload = datasource.values_for_column(
                 column_name=column_name,
                 limit=row_limit,
                 denormalize_column=denormalize_column,
             )

Review Comment:
   **Suggestion:** This is a check-then-act cache pattern without 
synchronization, so concurrent identical requests can all miss before the first 
write and each execute the expensive query. On dashboard load (multiple filters 
firing in parallel), this causes a thundering-herd race and defeats the 
intended performance improvement. Add a per-key lock/single-flight mechanism 
around the miss path so only one request computes and others wait/read the 
populated value. [race condition]
   
   <details>
   <summary><b>Severity Level:</b> Major ⚠️</summary>
   
   ```mdx
   - ⚠️ Heavy virtual datasets re-execute filter queries on cache cold-start.
   - ⚠️ Dashboard filter loads generate avoidable duplicate database queries.
   - ⚠️ Parallel filter requests increase load and response latency.
   ```
   </details>
   <details>
   <summary><b>Steps of Reproduction ✅ </b></summary>
   
   ```mdx
   1. Deploy Superset with a virtual dataset whose SQL is expensive 
(joins/aggregations) and
   expose it via a dashboard; native filter controls for this dataset issue 
HTTP requests to
   `GET /api/v1/datasource/<type>/<id>/column/<column>/values/` as exercised in
   `tests/integration_tests/datasource/api_tests.py:64-75`.
   
   2. Start Superset with multiple worker processes/threads so that concurrent 
browser filter
   requests are handled in parallel; ensure the column-values cache is cold for 
a given
   `{datasource, column, user}` combination (no existing `col_values:` entry).
   
   3. From multiple browser sessions (or a load test), trigger simultaneous 
requests for the
   same dataset and column (same user), causing multiple `get_column_values` 
invocations at
   `superset/datasource/api.py:52-67` with identical `cache_key` computed at
   `api.py:141-157`.
   
   4. Because the cache logic at `superset/datasource/api.py:159-168` simply 
performs a
   `cache_manager.data_cache.get(cache_key)` and checks for `None` without any 
locking or
   single-flight mechanism, all concurrent requests see a cache miss, fall 
through to the
   `try:` block at `api.py:170-175`, and each calls 
`datasource.values_for_column(...)`
   (implemented in `superset/models/helpers.py:2784-2849`), resulting in the 
expensive SQL
   query being executed once per concurrent request instead of once, defeating 
the intended
   performance improvement for the first miss.
   ```
   </details>
   
   [Fix in 
Cursor](https://app.codeant.ai/fix-in-ide?tool=cursor&prompt_id=2fc51c483f9d4149ae17a7628eb97d39&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
 | [Fix in VSCode 
Claude](https://app.codeant.ai/fix-in-ide?tool=vscode-claude&prompt_id=2fc51c483f9d4149ae17a7628eb97d39&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
   
   *(Use Cmd/Ctrl + Click for best experience)*
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/datasource/api.py
   **Line:** 159:175
   **Comment:**
        *Race Condition: This is a check-then-act cache pattern without 
synchronization, so concurrent identical requests can all miss before the first 
write and each execute the expensive query. On dashboard load (multiple filters 
firing in parallel), this causes a thundering-herd race and defeats the 
intended performance improvement. Add a per-key lock/single-flight mechanism 
around the miss path so only one request computes and others wait/read the 
populated value.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   Once fix is implemented, also check other comments on the same PR, and ask 
user if the user wants to fix the rest of the comments as well. if said yes, 
then fetch all the comments validate the correctness and implement a minimal fix
   ```
   </details>
   <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40839&comment_hash=a700916e844b86a821c28c0c4a1089b45e98eaf7e4b0db508cf4fa24f05cee14&reaction=like'>👍</a>
 | <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40839&comment_hash=a700916e844b86a821c28c0c4a1089b45e98eaf7e4b0db508cf4fa24f05cee14&reaction=dislike'>👎</a>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] perf(filters): cache distinct column values to avoid re-running virtual dataset query per filter [superset]

Reply via email to