codeant-ai-for-open-source[bot] commented on code in PR #40839:
URL: https://github.com/apache/superset/pull/40839#discussion_r3371118934
##########
superset/datasource/api.py:
##########
@@ -124,13 +131,48 @@ def get_column_values(
row_limit = apply_max_row_limit(app.config["FILTER_SELECT_ROW_LIMIT"])
denormalize_column = not datasource.normalize_columns
+
+ # Cache distinct column-value results so a dashboard with many filters
+ # backed by the same (often heavy) virtual dataset doesn't re-execute
+ # the wrapping query per filter (#39342). The cache key includes the
+ # user id so RLS-filtered datasources can't leak values across users,
+ # and the dataset's ``changed_on`` so an edit to the underlying SQL
+ # busts cached entries on the next request.
+ force = parse_boolean_string(request.args.get("force"))
+ cache_key = (
+ "col_values:"
+ + hashlib.sha256(
+ json.dumps(
+ {
+ "uid": datasource.uid,
+ "col": column_name,
+ "limit": row_limit,
+ "denorm": denormalize_column,
+ "user": get_user_id(),
+ "changed_on": str(getattr(datasource, "changed_on",
"")),
Review Comment:
**Suggestion:** The cache key only varies by dataset, column, limit,
denormalization flag, user id, and dataset `changed_on`, but it does not vary
when row-level-security policies themselves change. If an RLS rule is
tightened/changed for the same user, stale cached values can still be returned
until TTL expiry, leaking values that should now be filtered out. Include an
RLS policy fingerprint (or latest RLS update timestamp/role signature) in the
key, or bypass this cache when RLS filters are active. [security]
<details>
<summary><b>Severity Level:</b> Major ⚠️</summary>
```mdx
- ❌ Column-values endpoint serves data with outdated RLS filters.
- ❌ Users see values no longer allowed by tightened RLS.
- ⚠️ Dashboard native filters leak stale distinct dimension members.
```
</details>
<details>
<summary><b>Steps of Reproduction ✅ </b></summary>
```mdx
1. Configure a dataset with row-level security filters using the RLS REST
API at
`superset/row_level_security/api.py:64-76`, which persists
`RowLevelSecurityFilter`
entities defined in `superset/connectors/sqla/models.py:2234-2313` without
modifying any
`SqlaTable` `changed_on` field.
2. Log in as a user subject to those RLS filters and issue `GET
/api/v1/datasource/table/<id>/column/<column>/values/`, which hits
`DatasourceRestApi.get_column_values` in `superset/datasource/api.py:52-67`
and calls
`datasource.values_for_column(...)` (implemented in
`superset/models/helpers.py:2784-2849`), where
`get_sqla_row_level_filters()` is applied
at `helpers.py:51-53` to enforce RLS in the generated SQL.
3. Observe that on this first call the code at
`superset/datasource/api.py:141-157`
computes a cache key from `{"uid": datasource.uid, "col": column_name,
"limit": row_limit,
"denorm": denormalize_column, "user": get_user_id(), "changed_on":
str(getattr(datasource,
"changed_on", ""))}` and, after executing the query, stores the RLS-filtered
payload via
`cache_manager.data_cache.set(cache_key, payload, timeout=timeout)` at
`api.py:203-207`.
4. While the cache TTL (defaulting to `CACHE_DEFAULT_TIMEOUT` 300s) is still
active,
tighten or otherwise change the RLS rule for this dataset via `RLSRestApi`
(which updates
`RowLevelSecurityFilter.changed_on` but not `datasource.changed_on`), then
call the same
column-values endpoint again as the same user; because `get_column_values`
only keys on
`uid`, `col`, `limit`, `denorm`, `user`, and `datasource.changed_on` (and
does not include
any RLS state), the code at `superset/datasource/api.py:159-168` returns the
cached
payload and never re-runs `values_for_column()`, so the response still
contains values
allowed under the old RLS policy but now disallowed, until the cache entry
expires.
```
</details>
[Fix in
Cursor](https://app.codeant.ai/fix-in-ide?tool=cursor&prompt_id=710b77e2d7574c21b84778eebe960c6f&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
| [Fix in VSCode
Claude](https://app.codeant.ai/fix-in-ide?tool=vscode-claude&prompt_id=710b77e2d7574c21b84778eebe960c6f&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
*(Use Cmd/Ctrl + Click for best experience)*
<details>
<summary><b>Prompt for AI Agent 🤖 </b></summary>
```mdx
This is a comment left during a code review.
**Path:** superset/datasource/api.py
**Line:** 151:152
**Comment:**
*Security: The cache key only varies by dataset, column, limit,
denormalization flag, user id, and dataset `changed_on`, but it does not vary
when row-level-security policies themselves change. If an RLS rule is
tightened/changed for the same user, stale cached values can still be returned
until TTL expiry, leaking values that should now be filtered out. Include an
RLS policy fingerprint (or latest RLS update timestamp/role signature) in the
key, or bypass this cache when RLS filters are active.
Validate the correctness of the flagged issue. If correct, How can I resolve
this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask
user if the user wants to fix the rest of the comments as well. if said yes,
then fetch all the comments validate the correctness and implement a minimal fix
```
</details>
<a
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40839&comment_hash=0ff7b4987462cb957ecedd8485ddbf827e6f7b92a46cf5f05ef28227899cba71&reaction=like'>👍</a>
| <a
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40839&comment_hash=0ff7b4987462cb957ecedd8485ddbf827e6f7b92a46cf5f05ef28227899cba71&reaction=dislike'>👎</a>
##########
superset/datasource/api.py:
##########
@@ -124,13 +131,48 @@ def get_column_values(
row_limit = apply_max_row_limit(app.config["FILTER_SELECT_ROW_LIMIT"])
denormalize_column = not datasource.normalize_columns
+
+ # Cache distinct column-value results so a dashboard with many filters
+ # backed by the same (often heavy) virtual dataset doesn't re-execute
+ # the wrapping query per filter (#39342). The cache key includes the
+ # user id so RLS-filtered datasources can't leak values across users,
+ # and the dataset's ``changed_on`` so an edit to the underlying SQL
+ # busts cached entries on the next request.
+ force = parse_boolean_string(request.args.get("force"))
+ cache_key = (
+ "col_values:"
+ + hashlib.sha256(
+ json.dumps(
+ {
+ "uid": datasource.uid,
+ "col": column_name,
+ "limit": row_limit,
+ "denorm": denormalize_column,
+ "user": get_user_id(),
+ "changed_on": str(getattr(datasource, "changed_on",
"")),
+ },
+ sort_keys=True,
+ ).encode()
+ ).hexdigest()
+ )
+
+ if (
+ not force
+ and (cached := cache_manager.data_cache.get(cache_key)) is not None
+ ):
+ logger.debug(
+ "column-values cache HIT: uid=%s col=%s", datasource.uid,
column_name
+ )
+ response = self.response(200, result=cached)
+ response.headers["X-Cache-Status"] = "HIT"
+ return response
+
try:
payload = datasource.values_for_column(
column_name=column_name,
limit=row_limit,
denormalize_column=denormalize_column,
)
Review Comment:
**Suggestion:** This is a check-then-act cache pattern without
synchronization, so concurrent identical requests can all miss before the first
write and each execute the expensive query. On dashboard load (multiple filters
firing in parallel), this causes a thundering-herd race and defeats the
intended performance improvement. Add a per-key lock/single-flight mechanism
around the miss path so only one request computes and others wait/read the
populated value. [race condition]
<details>
<summary><b>Severity Level:</b> Major ⚠️</summary>
```mdx
- ⚠️ Heavy virtual datasets re-execute filter queries on cache cold-start.
- ⚠️ Dashboard filter loads generate avoidable duplicate database queries.
- ⚠️ Parallel filter requests increase load and response latency.
```
</details>
<details>
<summary><b>Steps of Reproduction ✅ </b></summary>
```mdx
1. Deploy Superset with a virtual dataset whose SQL is expensive
(joins/aggregations) and
expose it via a dashboard; native filter controls for this dataset issue
HTTP requests to
`GET /api/v1/datasource/<type>/<id>/column/<column>/values/` as exercised in
`tests/integration_tests/datasource/api_tests.py:64-75`.
2. Start Superset with multiple worker processes/threads so that concurrent
browser filter
requests are handled in parallel; ensure the column-values cache is cold for
a given
`{datasource, column, user}` combination (no existing `col_values:` entry).
3. From multiple browser sessions (or a load test), trigger simultaneous
requests for the
same dataset and column (same user), causing multiple `get_column_values`
invocations at
`superset/datasource/api.py:52-67` with identical `cache_key` computed at
`api.py:141-157`.
4. Because the cache logic at `superset/datasource/api.py:159-168` simply
performs a
`cache_manager.data_cache.get(cache_key)` and checks for `None` without any
locking or
single-flight mechanism, all concurrent requests see a cache miss, fall
through to the
`try:` block at `api.py:170-175`, and each calls
`datasource.values_for_column(...)`
(implemented in `superset/models/helpers.py:2784-2849`), resulting in the
expensive SQL
query being executed once per concurrent request instead of once, defeating
the intended
performance improvement for the first miss.
```
</details>
[Fix in
Cursor](https://app.codeant.ai/fix-in-ide?tool=cursor&prompt_id=2fc51c483f9d4149ae17a7628eb97d39&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
| [Fix in VSCode
Claude](https://app.codeant.ai/fix-in-ide?tool=vscode-claude&prompt_id=2fc51c483f9d4149ae17a7628eb97d39&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
*(Use Cmd/Ctrl + Click for best experience)*
<details>
<summary><b>Prompt for AI Agent 🤖 </b></summary>
```mdx
This is a comment left during a code review.
**Path:** superset/datasource/api.py
**Line:** 159:175
**Comment:**
*Race Condition: This is a check-then-act cache pattern without
synchronization, so concurrent identical requests can all miss before the first
write and each execute the expensive query. On dashboard load (multiple filters
firing in parallel), this causes a thundering-herd race and defeats the
intended performance improvement. Add a per-key lock/single-flight mechanism
around the miss path so only one request computes and others wait/read the
populated value.
Validate the correctness of the flagged issue. If correct, How can I resolve
this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask
user if the user wants to fix the rest of the comments as well. if said yes,
then fetch all the comments validate the correctness and implement a minimal fix
```
</details>
<a
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40839&comment_hash=a700916e844b86a821c28c0c4a1089b45e98eaf7e4b0db508cf4fa24f05cee14&reaction=like'>👍</a>
| <a
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40839&comment_hash=a700916e844b86a821c28c0c4a1089b45e98eaf7e4b0db508cf4fa24f05cee14&reaction=dislike'>👎</a>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]