Re: [PR] perf(filters): cache distinct column values to avoid re-running virtual dataset query per filter [superset]

via GitHub Sun, 07 Jun 2026 23:00:55 -0700


codeant-ai-for-open-source[bot] commented on code in PR #40839:
URL: https://github.com/apache/superset/pull/40839#discussion_r3371090985



##########
superset/datasource/api.py:
##########
@@ -124,13 +131,48 @@ def get_column_values(
 
         row_limit = apply_max_row_limit(app.config["FILTER_SELECT_ROW_LIMIT"])
         denormalize_column = not datasource.normalize_columns
+
+        # Cache distinct column-value results so a dashboard with many filters
+        # backed by the same (often heavy) virtual dataset doesn't re-execute
+        # the wrapping query per filter (#39342). The cache key includes the
+        # user id so RLS-filtered datasources can't leak values across users,
+        # and the dataset's ``changed_on`` so an edit to the underlying SQL
+        # busts cached entries on the next request.
+        force = parse_boolean_string(request.args.get("force"))
+        cache_key = (
+            "col_values:"
+            + hashlib.sha256(
+                json.dumps(
+                    {
+                        "uid": datasource.uid,
+                        "col": column_name,
+                        "limit": row_limit,
+                        "denorm": denormalize_column,
+                        "user": get_user_id(),
+                        "changed_on": str(getattr(datasource, "changed_on", 
"")),
+                    },

Review Comment:
   **🔴 Architect Review — CRITICAL**
   
   The column-values cache key uses get_user_id() for RLS isolation, but guest 
users have no numeric id, so all guest sessions share "user": null in the key 
even when their guest-token RLS differs, causing cached values from one 
embedded session to be served to another with different RLS.
   
   **Suggestion:** Incorporate the effective RLS context into the cache key 
(for example via security_manager.get_rls_cache_key(datasource) and/or 
guest-token-specific components) instead of relying solely on get_user_id(), so 
guest users with different rls_rules do not share cache entries.
   
   
   [Fix in 
Cursor](https://app.codeant.ai/fix-in-ide?tool=cursor&prompt_id=207ea539f68f41a1be110c7399c518b8&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
 | [Fix in VSCode 
Claude](https://app.codeant.ai/fix-in-ide?tool=vscode-claude&prompt_id=207ea539f68f41a1be110c7399c518b8&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
   
   *(Use Cmd/Ctrl + Click for best experience)*
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is an **Architect / Logical Review** comment left during a code review. 
These reviews are first-class, important findings — not optional suggestions. 
Do NOT dismiss this as a 'big architectural change' just because the title says 
architect review; most of these can be resolved with a small, localized fix 
once the intent is understood.
   
   **Path:** superset/datasource/api.py
   **Line:** 147:153
   **Comment:**
        *CRITICAL: The column-values cache key uses get_user_id() for RLS 
isolation, but guest users have no numeric id, so all guest sessions share 
"user": null in the key even when their guest-token RLS differs, causing cached 
values from one embedded session to be served to another with different RLS.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   If a suggested approach is provided above, use it as the authoritative 
instruction. If no explicit code suggestion is given, you MUST still draft and 
apply your own minimal, localized fix — do not punt back with 'no suggestion 
provided, review manually'. Keep the change as small as possible: add a guard 
clause, gate on a loading state, reorder an await, wrap in a conditional, etc. 
Do not refactor surrounding code or expand scope beyond the finding.
   Once fix is implemented, also check other comments on the same PR, and ask 
user if the user wants to fix the rest of the comments as well. if said yes, 
then fetch all the comments validate the correctness and implement a minimal fix
   ```
   </details>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] perf(filters): cache distinct column values to avoid re-running virtual dataset query per filter [superset]

Reply via email to