codeant-ai-for-open-source[bot] commented on code in PR #41125:
URL: https://github.com/apache/superset/pull/41125#discussion_r3463923230
##########
superset/sql/parse.py:
##########
@@ -1840,16 +1840,39 @@ def process_jinja_sql(
def sanitize_clause(clause: str, engine: str) -> str:
"""
- Make sure the SQL clause is valid.
+ Validate a SQL clause and return it unchanged.
+
+ The clause is parsed to ensure it is a single, well-formed statement. We
+ intentionally return the *original* text rather than a re-rendered version:
+ round-tripping user SQL through SQLGlot's dialect generator can silently
+ alter semantics. For example, the Postgres dialect (borrowed by several
+ engines) rewrites ``ROUND(AVG(x), n)`` to ``ROUND(CAST(AVG(x) AS DECIMAL),
+ n)``, which rounds the value to an integer before the explicit ``ROUND`` on
+ engines whose unqualified ``DECIMAL`` defaults to scale 0 (see #36113).
+
+ Comments are the one exception: a trailing line comment can comment out
+ surrounding SQL once the clause is embedded into a larger query (e.g.
+ wrapped in parentheses), so any clause that contains comments is
re-rendered
+ to normalize them into a safe form. That re-rendering uses the *base*
dialect
+ rather than the engine dialect, so it normalizes comments without
re-applying
+ the engine-specific rewrites (e.g. the Postgres ``ROUND``/``CAST`` rewrite
+ from #36113) that we deliberately avoid above. A trailing statement
+ terminator is likewise stripped, since callers embed the clause inside a
+ larger fragment (``WHERE (...)``) where a stray ``;`` would produce invalid
+ SQL.
"""
try:
statement = SQLStatement(clause, engine)
+ parsed = statement._parsed # pylint: disable=protected-access
+ if not any(node.comments for node in parsed.walk()):
+ return clause.rstrip().rstrip(";").rstrip()
+
Review Comment:
**Suggestion:** Returning the original clause text (minus trailing
semicolon) removes canonical normalization for most clauses, so semantically
equivalent filters with different whitespace/casing now hash to different query
cache keys and miss cache reuse. This is an observable performance regression
in the query caching path that currently assumes `sanitize_clause` normalizes
extras before key generation. Preserve a canonical form for cache-key
computation (even if execution uses verbatim SQL) to avoid cache fragmentation.
[cache]
<details>
<summary><b>Severity Level:</b> Major ⚠️</summary>
```mdx
- ⚠️ Query cache misses for equivalent extras WHERE/HAVING clauses.
- ⚠️ Increased database load when filters reused across charts.
```
</details>
<details>
<summary><b>Steps of Reproduction ✅ </b></summary>
```mdx
1. Create one chart with a custom SQL extras filter `WHERE col = 1` and
another chart with
`WHERE col = 1 ` (different whitespace/casing) so that `form_data["where"]`
differs
textually but is semantically identical in `BaseViz.query_obj()` at
`superset/viz.py:418-433`.
2. For both charts, `BaseViz.query_obj()` calls `sanitize_clause(clause,
engine)` for the
`where`/`having` fields at `superset/viz.py:418-425`; since these clauses
contain no
comments, `sanitize_clause()` takes the fast path at
`superset/sql/parse.py:1865-69` and
returns `clause.rstrip().rstrip(";").rstrip()`, preserving internal
whitespace/casing and
only trimming trailing semicolons and surrounding spaces.
3. The sanitized (but essentially non-normalized) clauses are embedded into
the `extras`
dict (`extras = {"having": ..., "time_grain_sqla": ..., "where": ...}`)
returned from
`BaseViz.query_obj()` at `superset/viz.py:426-33`, and then included
verbatim in the
`QueryObjectDict` passed into `BaseViz.get_df_payload()` and
`BaseViz.cache_key()` at
`superset/viz.py:475-85`.
4. In `BaseViz.cache_key()` (`superset/viz.py:475-35`), the query object
(including
`extras["where"]`/`extras["having"]`) is copied into `cache_dict`, combined
with other
fields (datasource UID, `extra_cache_keys`, row-level security, etc.),
serialized via
`self.json_dumps(cache_dict, sort_keys=True)`, and hashed with
`hash_from_str`;
semantically equivalent filters that differ only in whitespace/casing now
produce
different JSON representations and therefore different cache keys, leading
to fragmented
query caching and missed reuse for repeated executions of logically
identical filtered
queries.
```
</details>
[](https://app.codeant.ai/fix-in-ide?tool=cursor&prompt_id=67165b59157144deb0ee957f70a13cf5&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
[](https://app.codeant.ai/fix-in-ide?tool=vscode-claude&prompt_id=67165b59157144deb0ee957f70a13cf5&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
*(Use Cmd/Ctrl + Click for best experience)*
<details>
<summary><b>Prompt for AI Agent 🤖 </b></summary>
```mdx
This is a comment left during a code review.
**Path:** superset/sql/parse.py
**Line:** 1868:1869
**Comment:**
*Cache: Returning the original clause text (minus trailing semicolon)
removes canonical normalization for most clauses, so semantically equivalent
filters with different whitespace/casing now hash to different query cache keys
and miss cache reuse. This is an observable performance regression in the query
caching path that currently assumes `sanitize_clause` normalizes extras before
key generation. Preserve a canonical form for cache-key computation (even if
execution uses verbatim SQL) to avoid cache fragmentation.
Validate the correctness of the flagged issue. If correct, How can I resolve
this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask
user if the user wants to fix the rest of the comments as well. if said yes,
then fetch all the comments validate the correctness and implement a minimal fix
```
</details>
<a
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F41125&comment_hash=e1f4f2896fcf413a11fb411dcb559b6adb0f491715c5d8804f6b876c24605c0c&reaction=like'>👍</a>
| <a
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F41125&comment_hash=e1f4f2896fcf413a11fb411dcb559b6adb0f491715c5d8804f6b876c24605c0c&reaction=dislike'>👎</a>
##########
superset/sql/parse.py:
##########
@@ -1840,16 +1840,39 @@ def process_jinja_sql(
def sanitize_clause(clause: str, engine: str) -> str:
"""
- Make sure the SQL clause is valid.
+ Validate a SQL clause and return it unchanged.
+
+ The clause is parsed to ensure it is a single, well-formed statement. We
+ intentionally return the *original* text rather than a re-rendered version:
+ round-tripping user SQL through SQLGlot's dialect generator can silently
+ alter semantics. For example, the Postgres dialect (borrowed by several
+ engines) rewrites ``ROUND(AVG(x), n)`` to ``ROUND(CAST(AVG(x) AS DECIMAL),
+ n)``, which rounds the value to an integer before the explicit ``ROUND`` on
+ engines whose unqualified ``DECIMAL`` defaults to scale 0 (see #36113).
+
+ Comments are the one exception: a trailing line comment can comment out
+ surrounding SQL once the clause is embedded into a larger query (e.g.
+ wrapped in parentheses), so any clause that contains comments is
re-rendered
+ to normalize them into a safe form. That re-rendering uses the *base*
dialect
+ rather than the engine dialect, so it normalizes comments without
re-applying
+ the engine-specific rewrites (e.g. the Postgres ``ROUND``/``CAST`` rewrite
+ from #36113) that we deliberately avoid above. A trailing statement
+ terminator is likewise stripped, since callers embed the clause inside a
+ larger fragment (``WHERE (...)``) where a stray ``;`` would produce invalid
+ SQL.
"""
try:
statement = SQLStatement(clause, engine)
+ parsed = statement._parsed # pylint: disable=protected-access
+ if not any(node.comments for node in parsed.walk()):
+ return clause.rstrip().rstrip(";").rstrip()
+
return _normalized_generator(
- SQLGLOT_DIALECTS.get(engine),
+ None,
pretty=False,
comments=True,
).generate(
- statement._parsed, # pylint: disable=protected-access
+ parsed,
copy=True,
)
Review Comment:
**Suggestion:** Re-rendering comment-bearing clauses with the base SQLGlot
dialect can break engine-specific syntax after successful parsing (for example
identifier quoting or dialect-only expressions), because the AST is generated
with a different dialect than it was parsed for. This can make valid user
clauses fail at execution time only when a comment is present. Keep generation
in the same engine dialect (or only normalize comments without cross-dialect
generation) so syntax stays executable for that backend. [api mismatch]
<details>
<summary><b>Severity Level:</b> Major ⚠️</summary>
```mdx
- ❌ Charts using extras WHERE/HAVING with comments may fail execution.
- ⚠️ Dialect-specific SQL expressions normalized incorrectly for commented
clauses.
```
</details>
<details>
<summary><b>Steps of Reproduction ✅ </b></summary>
```mdx
1. Configure a chart that uses SQL extras WHERE/HAVING filters with
engine-specific syntax
and a trailing comment, stored in `form_data["where"]` or
`form_data["having"]` inside
`BaseViz.query_obj()` at `superset/viz.py:418-425` (from `superset/viz.py`
lines 400-479).
2. When `BaseViz.query_obj()` builds the query object, it calls
`sanitize_clause(clause,
engine)` for each non-empty `where`/`having` clause (same block at
`superset/viz.py:418-425`), passing the datasource database engine string.
3. Inside `sanitize_clause()` at `superset/sql/parse.py:1841-80`, the clause
is parsed as
a `SQLStatement(clause, engine)` using the engine-specific dialect; for
clauses containing
comments the branch at `superset/sql/parse.py:1865-78` re-renders the parsed
AST via
`_normalized_generator(None, pretty=False, comments=True).generate(parsed,
copy=True)`,
i.e., using the base SQLGlot dialect instead of the engine dialect.
4. The re-rendered string, now potentially stripped of engine-specific
syntax (identifier
quoting or dialect-only constructs), is then propagated as
`extras["where"]`/`extras["having"]` into the SQL building path
(`superset/models/helpers.py:3648-38` and
`superset/semantic_layers/mapper.py:413-33`),
where `_process_select_expression(..., engine=self.database.backend, ...)`
validates and
embeds it for execution; if the base-dialect rendering no longer matches
valid syntax for
that backend, the clause fails to parse or execute when a comment is
present, breaking the
chart query.
```
</details>
[](https://app.codeant.ai/fix-in-ide?tool=cursor&prompt_id=1706dc916b724b339cac14948f03a8b1&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
[](https://app.codeant.ai/fix-in-ide?tool=vscode-claude&prompt_id=1706dc916b724b339cac14948f03a8b1&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
*(Use Cmd/Ctrl + Click for best experience)*
<details>
<summary><b>Prompt for AI Agent 🤖 </b></summary>
```mdx
This is a comment left during a code review.
**Path:** superset/sql/parse.py
**Line:** 1870:1877
**Comment:**
*Api Mismatch: Re-rendering comment-bearing clauses with the base
SQLGlot dialect can break engine-specific syntax after successful parsing (for
example identifier quoting or dialect-only expressions), because the AST is
generated with a different dialect than it was parsed for. This can make valid
user clauses fail at execution time only when a comment is present. Keep
generation in the same engine dialect (or only normalize comments without
cross-dialect generation) so syntax stays executable for that backend.
Validate the correctness of the flagged issue. If correct, How can I resolve
this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask
user if the user wants to fix the rest of the comments as well. if said yes,
then fetch all the comments validate the correctness and implement a minimal fix
```
</details>
<a
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F41125&comment_hash=dadbbdfac6eadffc753c8326bc4e6d13979a4064d8885c3c513c6dc08bb49292&reaction=like'>👍</a>
| <a
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F41125&comment_hash=dadbbdfac6eadffc753c8326bc4e6d13979a4064d8885c3c513c6dc08bb49292&reaction=dislike'>👎</a>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]