codeant-ai-for-open-source[bot] commented on code in PR #41125:
URL: https://github.com/apache/superset/pull/41125#discussion_r3463923230


##########
superset/sql/parse.py:
##########
@@ -1840,16 +1840,39 @@ def process_jinja_sql(
 
 def sanitize_clause(clause: str, engine: str) -> str:
     """
-    Make sure the SQL clause is valid.
+    Validate a SQL clause and return it unchanged.
+
+    The clause is parsed to ensure it is a single, well-formed statement. We
+    intentionally return the *original* text rather than a re-rendered version:
+    round-tripping user SQL through SQLGlot's dialect generator can silently
+    alter semantics. For example, the Postgres dialect (borrowed by several
+    engines) rewrites ``ROUND(AVG(x), n)`` to ``ROUND(CAST(AVG(x) AS DECIMAL),
+    n)``, which rounds the value to an integer before the explicit ``ROUND`` on
+    engines whose unqualified ``DECIMAL`` defaults to scale 0 (see #36113).
+
+    Comments are the one exception: a trailing line comment can comment out
+    surrounding SQL once the clause is embedded into a larger query (e.g.
+    wrapped in parentheses), so any clause that contains comments is 
re-rendered
+    to normalize them into a safe form. That re-rendering uses the *base* 
dialect
+    rather than the engine dialect, so it normalizes comments without 
re-applying
+    the engine-specific rewrites (e.g. the Postgres ``ROUND``/``CAST`` rewrite
+    from #36113) that we deliberately avoid above. A trailing statement
+    terminator is likewise stripped, since callers embed the clause inside a
+    larger fragment (``WHERE (...)``) where a stray ``;`` would produce invalid
+    SQL.
     """
     try:
         statement = SQLStatement(clause, engine)
+        parsed = statement._parsed  # pylint: disable=protected-access
+        if not any(node.comments for node in parsed.walk()):
+            return clause.rstrip().rstrip(";").rstrip()
+

Review Comment:
   **Suggestion:** Returning the original clause text (minus trailing 
semicolon) removes canonical normalization for most clauses, so semantically 
equivalent filters with different whitespace/casing now hash to different query 
cache keys and miss cache reuse. This is an observable performance regression 
in the query caching path that currently assumes `sanitize_clause` normalizes 
extras before key generation. Preserve a canonical form for cache-key 
computation (even if execution uses verbatim SQL) to avoid cache fragmentation. 
[cache]
   
   <details>
   <summary><b>Severity Level:</b> Major ⚠️</summary>
   
   ```mdx
   - ⚠️ Query cache misses for equivalent extras WHERE/HAVING clauses.
   - ⚠️ Increased database load when filters reused across charts.
   ```
   </details>
   <details>
   <summary><b>Steps of Reproduction ✅ </b></summary>
   
   ```mdx
   1. Create one chart with a custom SQL extras filter `WHERE col = 1` and 
another chart with
   `WHERE col = 1 ` (different whitespace/casing) so that `form_data["where"]` 
differs
   textually but is semantically identical in `BaseViz.query_obj()` at
   `superset/viz.py:418-433`.
   
   2. For both charts, `BaseViz.query_obj()` calls `sanitize_clause(clause, 
engine)` for the
   `where`/`having` fields at `superset/viz.py:418-425`; since these clauses 
contain no
   comments, `sanitize_clause()` takes the fast path at 
`superset/sql/parse.py:1865-69` and
   returns `clause.rstrip().rstrip(";").rstrip()`, preserving internal 
whitespace/casing and
   only trimming trailing semicolons and surrounding spaces.
   
   3. The sanitized (but essentially non-normalized) clauses are embedded into 
the `extras`
   dict (`extras = {"having": ..., "time_grain_sqla": ..., "where": ...}`) 
returned from
   `BaseViz.query_obj()` at `superset/viz.py:426-33`, and then included 
verbatim in the
   `QueryObjectDict` passed into `BaseViz.get_df_payload()` and 
`BaseViz.cache_key()` at
   `superset/viz.py:475-85`.
   
   4. In `BaseViz.cache_key()` (`superset/viz.py:475-35`), the query object 
(including
   `extras["where"]`/`extras["having"]`) is copied into `cache_dict`, combined 
with other
   fields (datasource UID, `extra_cache_keys`, row-level security, etc.), 
serialized via
   `self.json_dumps(cache_dict, sort_keys=True)`, and hashed with 
`hash_from_str`;
   semantically equivalent filters that differ only in whitespace/casing now 
produce
   different JSON representations and therefore different cache keys, leading 
to fragmented
   query caching and missed reuse for repeated executions of logically 
identical filtered
   queries.
   ```
   </details>
   
   [![Fix in 
Cursor](https://new-codeant-butcket.s3.us-west-1.amazonaws.com/badges/fix-in-cursor-flat.svg)](https://app.codeant.ai/fix-in-ide?tool=cursor&prompt_id=67165b59157144deb0ee957f70a13cf5&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
 [![Fix in VSCode 
Claude](https://new-codeant-butcket.s3.us-west-1.amazonaws.com/badges/fix-in-vscode-claude-flat.svg)](https://app.codeant.ai/fix-in-ide?tool=vscode-claude&prompt_id=67165b59157144deb0ee957f70a13cf5&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
   
   *(Use Cmd/Ctrl + Click for best experience)*
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/sql/parse.py
   **Line:** 1868:1869
   **Comment:**
        *Cache: Returning the original clause text (minus trailing semicolon) 
removes canonical normalization for most clauses, so semantically equivalent 
filters with different whitespace/casing now hash to different query cache keys 
and miss cache reuse. This is an observable performance regression in the query 
caching path that currently assumes `sanitize_clause` normalizes extras before 
key generation. Preserve a canonical form for cache-key computation (even if 
execution uses verbatim SQL) to avoid cache fragmentation.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   Once fix is implemented, also check other comments on the same PR, and ask 
user if the user wants to fix the rest of the comments as well. if said yes, 
then fetch all the comments validate the correctness and implement a minimal fix
   ```
   </details>
   <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F41125&comment_hash=e1f4f2896fcf413a11fb411dcb559b6adb0f491715c5d8804f6b876c24605c0c&reaction=like'>👍</a>
 | <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F41125&comment_hash=e1f4f2896fcf413a11fb411dcb559b6adb0f491715c5d8804f6b876c24605c0c&reaction=dislike'>👎</a>



##########
superset/sql/parse.py:
##########
@@ -1840,16 +1840,39 @@ def process_jinja_sql(
 
 def sanitize_clause(clause: str, engine: str) -> str:
     """
-    Make sure the SQL clause is valid.
+    Validate a SQL clause and return it unchanged.
+
+    The clause is parsed to ensure it is a single, well-formed statement. We
+    intentionally return the *original* text rather than a re-rendered version:
+    round-tripping user SQL through SQLGlot's dialect generator can silently
+    alter semantics. For example, the Postgres dialect (borrowed by several
+    engines) rewrites ``ROUND(AVG(x), n)`` to ``ROUND(CAST(AVG(x) AS DECIMAL),
+    n)``, which rounds the value to an integer before the explicit ``ROUND`` on
+    engines whose unqualified ``DECIMAL`` defaults to scale 0 (see #36113).
+
+    Comments are the one exception: a trailing line comment can comment out
+    surrounding SQL once the clause is embedded into a larger query (e.g.
+    wrapped in parentheses), so any clause that contains comments is 
re-rendered
+    to normalize them into a safe form. That re-rendering uses the *base* 
dialect
+    rather than the engine dialect, so it normalizes comments without 
re-applying
+    the engine-specific rewrites (e.g. the Postgres ``ROUND``/``CAST`` rewrite
+    from #36113) that we deliberately avoid above. A trailing statement
+    terminator is likewise stripped, since callers embed the clause inside a
+    larger fragment (``WHERE (...)``) where a stray ``;`` would produce invalid
+    SQL.
     """
     try:
         statement = SQLStatement(clause, engine)
+        parsed = statement._parsed  # pylint: disable=protected-access
+        if not any(node.comments for node in parsed.walk()):
+            return clause.rstrip().rstrip(";").rstrip()
+
         return _normalized_generator(
-            SQLGLOT_DIALECTS.get(engine),
+            None,
             pretty=False,
             comments=True,
         ).generate(
-            statement._parsed,  # pylint: disable=protected-access
+            parsed,
             copy=True,
         )

Review Comment:
   **Suggestion:** Re-rendering comment-bearing clauses with the base SQLGlot 
dialect can break engine-specific syntax after successful parsing (for example 
identifier quoting or dialect-only expressions), because the AST is generated 
with a different dialect than it was parsed for. This can make valid user 
clauses fail at execution time only when a comment is present. Keep generation 
in the same engine dialect (or only normalize comments without cross-dialect 
generation) so syntax stays executable for that backend. [api mismatch]
   
   <details>
   <summary><b>Severity Level:</b> Major ⚠️</summary>
   
   ```mdx
   - ❌ Charts using extras WHERE/HAVING with comments may fail execution.
   - ⚠️ Dialect-specific SQL expressions normalized incorrectly for commented 
clauses.
   ```
   </details>
   <details>
   <summary><b>Steps of Reproduction ✅ </b></summary>
   
   ```mdx
   1. Configure a chart that uses SQL extras WHERE/HAVING filters with 
engine-specific syntax
   and a trailing comment, stored in `form_data["where"]` or 
`form_data["having"]` inside
   `BaseViz.query_obj()` at `superset/viz.py:418-425` (from `superset/viz.py` 
lines 400-479).
   
   2. When `BaseViz.query_obj()` builds the query object, it calls 
`sanitize_clause(clause,
   engine)` for each non-empty `where`/`having` clause (same block at
   `superset/viz.py:418-425`), passing the datasource database engine string.
   
   3. Inside `sanitize_clause()` at `superset/sql/parse.py:1841-80`, the clause 
is parsed as
   a `SQLStatement(clause, engine)` using the engine-specific dialect; for 
clauses containing
   comments the branch at `superset/sql/parse.py:1865-78` re-renders the parsed 
AST via
   `_normalized_generator(None, pretty=False, comments=True).generate(parsed, 
copy=True)`,
   i.e., using the base SQLGlot dialect instead of the engine dialect.
   
   4. The re-rendered string, now potentially stripped of engine-specific 
syntax (identifier
   quoting or dialect-only constructs), is then propagated as
   `extras["where"]`/`extras["having"]` into the SQL building path
   (`superset/models/helpers.py:3648-38` and 
`superset/semantic_layers/mapper.py:413-33`),
   where `_process_select_expression(..., engine=self.database.backend, ...)` 
validates and
   embeds it for execution; if the base-dialect rendering no longer matches 
valid syntax for
   that backend, the clause fails to parse or execute when a comment is 
present, breaking the
   chart query.
   ```
   </details>
   
   [![Fix in 
Cursor](https://new-codeant-butcket.s3.us-west-1.amazonaws.com/badges/fix-in-cursor-flat.svg)](https://app.codeant.ai/fix-in-ide?tool=cursor&prompt_id=1706dc916b724b339cac14948f03a8b1&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
 [![Fix in VSCode 
Claude](https://new-codeant-butcket.s3.us-west-1.amazonaws.com/badges/fix-in-vscode-claude-flat.svg)](https://app.codeant.ai/fix-in-ide?tool=vscode-claude&prompt_id=1706dc916b724b339cac14948f03a8b1&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
   
   *(Use Cmd/Ctrl + Click for best experience)*
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/sql/parse.py
   **Line:** 1870:1877
   **Comment:**
        *Api Mismatch: Re-rendering comment-bearing clauses with the base 
SQLGlot dialect can break engine-specific syntax after successful parsing (for 
example identifier quoting or dialect-only expressions), because the AST is 
generated with a different dialect than it was parsed for. This can make valid 
user clauses fail at execution time only when a comment is present. Keep 
generation in the same engine dialect (or only normalize comments without 
cross-dialect generation) so syntax stays executable for that backend.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   Once fix is implemented, also check other comments on the same PR, and ask 
user if the user wants to fix the rest of the comments as well. if said yes, 
then fetch all the comments validate the correctness and implement a minimal fix
   ```
   </details>
   <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F41125&comment_hash=dadbbdfac6eadffc753c8326bc4e6d13979a4064d8885c3c513c6dc08bb49292&reaction=like'>👍</a>
 | <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F41125&comment_hash=dadbbdfac6eadffc753c8326bc4e6d13979a4064d8885c3c513c6dc08bb49292&reaction=dislike'>👎</a>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to