codeant-ai-for-open-source[bot] commented on code in PR #39976:
URL: https://github.com/apache/superset/pull/39976#discussion_r3226831510


##########
superset/sql/rls_splice.py:
##########
@@ -0,0 +1,482 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+"""
+RLS predicate injection via text splicing.
+
+Instead of round-tripping through sqlglot's generator (which transpiles
+dialect-specific functions like ``LAST_DAY`` into something else), this 
approach:
+
+  1. Parses the SQL with sqlglot — only to understand structure (scope tree).
+  2. Uses sqlglot's tokenizer to get byte-accurate positions for every token
+     in the original SQL string.
+  3. For each ``SELECT`` scope that references a table with an RLS predicate,
+     finds the exact byte offset to inject at — either the end of an existing
+     ``WHERE`` clause, or just before ``GROUP BY`` / ``ORDER BY`` / ``HAVING``
+     / ``LIMIT`` / the closing paren of a subquery.
+  4. Splices the predicate text directly into the original string at that
+     offset — never calling ``.sql()``, so the generator never runs.
+
+Result: everything outside the splice points is the original SQL, byte for
+byte. Dialect-specific functions, comments, and formatting are all preserved
+exactly.
+
+Known limitations:
+  - SQL that fails to parse under the chosen dialect raises a ``ParseError``.
+    A thin dialect subclass is still required for parsing — but only for
+    parsing, not generation.
+  - Predicate strings are spliced in as raw SQL. They must come from a trusted
+    source (the RLS config), not user input.
+"""
+
+from __future__ import annotations
+
+from typing import TYPE_CHECKING
+
+import sqlglot
+from sqlglot import exp
+from sqlglot.optimizer.scope import traverse_scope
+from sqlglot.tokens import Token, TokenType
+
+if TYPE_CHECKING:
+    from superset.sql.parse import Table
+
+
+# Token types that end a WHERE clause / FROM section at the current paren 
depth,
+# indicating where a new predicate must be inserted just before.
+_CLAUSE_ENDS = {
+    TokenType.GROUP_BY,
+    TokenType.HAVING,
+    TokenType.ORDER_BY,
+    TokenType.WINDOW,
+    TokenType.QUALIFY,
+    TokenType.LIMIT,
+    TokenType.FETCH,
+    TokenType.CLUSTER_BY,
+    TokenType.DISTRIBUTE_BY,
+    TokenType.SORT_BY,
+    TokenType.CONNECT_BY,
+    TokenType.START_WITH,
+    TokenType.UNION,
+    TokenType.INTERSECT,
+    TokenType.EXCEPT,
+}
+
+_JOIN_STARTS = {
+    TokenType.JOIN,
+    TokenType.STRAIGHT_JOIN,
+    TokenType.JOIN_MARKER,
+}
+
+
+def _splice_priority(text: str) -> int:
+    """
+    Priority for applying splices at the same offset.
+
+    Insert full SQL fragments (WHERE/ON/predicates) before closing parens so
+    wrapping splices like ``pred AND (existing)`` compose correctly.
+    """
+    return 1 if text != ")" else 0
+
+
+def _before_whitespace(sql: str, offset: int) -> int:
+    """Back up past any whitespace immediately before *offset*."""
+    while offset > 0 and sql[offset - 1] in (" ", "\t", "\n", "\r"):
+        offset -= 1
+    return offset
+
+
+def _before_trivia(sql: str, offset: int) -> int:
+    """
+    Back up past whitespace and adjacent comments immediately before *offset*.
+
+    This ensures insertion points land before inline/block comments that appear
+    between `FROM`/`WHERE` and the next clause keyword.
+    """
+    while True:
+        offset = _before_whitespace(sql, offset)
+
+        # Inline comment ending at offset, eg: "... -- comment\nGROUP BY".
+        line_start = sql.rfind("\n", 0, offset) + 1
+        inline_comment_start = sql.rfind("--", line_start, offset)
+        if inline_comment_start != -1:
+            offset = inline_comment_start
+            continue

Review Comment:
   **Suggestion:** The inline-comment detection treats any `--` between the 
start of the line and the clause boundary as a comment, even when it appears 
inside a string literal. This can move the insertion point into quoted text 
(for example in `WHERE note = '--x' GROUP BY ...`), causing the injected 
`)`/`WHERE` splice to corrupt SQL syntax. Restrict comment detection to real 
comment tokens (or use tokenizer metadata) instead of raw `rfind("--", ...)` on 
source text. [incorrect condition logic]
   
   <details>
   <summary><b>Severity Level:</b> Critical 🚨</summary>
   
   ```mdx
   - CRITICAL Splice-mode RLS can generate syntactically invalid SQL.
   - CRITICAL RLS predicates may not constrain rows as intended.
   - WARN Queries with `--` in strings break on affected engines.
   - WARN SQL Lab execution fails for specific RLS-filtered queries.
   ```
   </details>
   <details>
   <summary><b>Steps of Reproduction ✅ </b></summary>
   
   ```mdx
   1. Configure an environment where splice-mode RLS is used: set a database 
engine spec's
   `rls_method` to `RLSMethod.AS_PREDICATE_SPLICE` (see comment in
   `superset/db_engine_specs/base.py:560-564` explaining when splice mode 
should be used).
   
   2. Run a SQL query through SQL Lab (or any path using the SQL executor) 
against that
   database with a table that has RLS predicates (so `apply_rls` is actually 
invoked). The
   query should include a string literal containing `--` before the next clause 
boundary, for
   example:
   
      `SELECT * FROM some_table t WHERE t.note = '--x' GROUP BY t.id`.
   
      RLS application for SQL Lab goes through 
`superset/sql/execution/executor.py:440-444`
      which calls `_apply_rls_to_script`, then `superset/utils/rls.py:32-67` 
(`apply_rls`)
      which calls `parsed_statement.apply_rls(...)`.
   
   3. With `method=RLSMethod.AS_PREDICATE_SPLICE`, `SQLStatement.apply_rls` in
   `superset/sql/parse.py:972-995` dispatches to `_apply_rls_splice`, which in 
turn calls
   `apply_rls_splice` in `superset/sql/rls_splice.py:149-198`. For each 
affected SELECT
   scope, `_splices_for_scope` (lines 201-255) calls `_find_where_splice` 
(lines 386-405) to
   inject predicates into the WHERE clause.
   
   4. Inside `_find_where_splice`, `_find_condition_end` (lines 357-383) is 
used to locate
   the end of the existing WHERE condition, and it calls `_before_trivia(sql, 
tok.start)` at
   lines 373-381 to back up over comments/whitespace before the `GROUP BY` 
token.
   `_before_trivia` at `superset/sql/rls_splice.py:103-127` computes 
`line_start` and then
   uses `sql.rfind("--", line_start, offset)` (lines 114-116). For the sample 
query, this
   `rfind` matches the `--` inside the string literal `'--x'`, so 
`inline_comment_start`
   points into the quoted text. The function then sets `offset = 
inline_comment_start` (line
   118), causing `_find_condition_end` to return a `body_end` index inside the 
string
   literal. When `_find_where_splice` inserts `")"` at `body_end` (line 17-20 
in the
   `_find_where_splice` snippet), the closing parenthesis is spliced into the 
middle of the
   string literal instead of before `GROUP BY`, corrupting the SQL (typically 
resulting in a
   syntax error or malformed condition where the RLS predicate is not actually 
applied as
   intended).
   ```
   </details>
   
   [Fix in 
Cursor](https://app.codeant.ai/fix-in-ide?tool=cursor&prompt=This%20is%20a%20comment%20left%20during%20a%20code%20review.%0A%0A%2A%2APath%3A%2A%2A%20superset%2Fsql%2Frls_splice.py%0A%2A%2ALine%3A%2A%2A%20114%3A118%0A%2A%2AComment%3A%2A%2A%0A%09%2AIncorrect%20Condition%20Logic%3A%20The%20inline-comment%20detection%20treats%20any%20%60--%60%20between%20the%20start%20of%20the%20line%20and%20the%20clause%20boundary%20as%20a%20comment%2C%20even%20when%20it%20appears%20inside%20a%20string%20literal.%20This%20can%20move%20the%20insertion%20point%20into%20quoted%20text%20%28for%20example%20in%20%60WHERE%20note%20%3D%20%27--x%27%20GROUP%20BY%20...%60%29%2C%20causing%20the%20injected%20%60%29%60%2F%60WHERE%60%20splice%20to%20corrupt%20SQL%20syntax.%20Restrict%20comment%20detection%20to%20real%20comment%20tokens%20%28or%20use%20tokenizer%20metadata%29%20instead%20of%20raw%20%60rfind%28%22--%22%2C%20...%29%60%20on%20source%20text.%0A%0AValidate%20the%20correctness%20of%20the%20flagged
 
%20issue.%20If%20correct%2C%20How%20can%20I%20resolve%20this%3F%20If%20you%20propose%20a%20fix%2C%20implement%20it%20and%20please%20make%20it%20concise.%0AOnce%20fix%20is%20implemented%2C%20also%20check%20other%20comments%20on%20the%20same%20PR%2C%20and%20ask%20user%20if%20the%20user%20wants%20to%20fix%20the%20rest%20of%20the%20comments%20as%20well.%20if%20said%20yes%2C%20then%20fetch%20all%20the%20comments%20validate%20the%20correctness%20and%20implement%20a%20minimal%20fix%0A)
 | [Fix in VSCode 
Claude](https://app.codeant.ai/fix-in-ide?tool=vscode-claude&prompt=This%20is%20a%20comment%20left%20during%20a%20code%20review.%0A%0A%2A%2APath%3A%2A%2A%20superset%2Fsql%2Frls_splice.py%0A%2A%2ALine%3A%2A%2A%20114%3A118%0A%2A%2AComment%3A%2A%2A%0A%09%2AIncorrect%20Condition%20Logic%3A%20The%20inline-comment%20detection%20treats%20any%20%60--%60%20between%20the%20start%20of%20the%20line%20and%20the%20clause%20boundary%20as%20a%20comment%2C%20even%20when%20it%20appears%20inside%20a%20string%2
 
0literal.%20This%20can%20move%20the%20insertion%20point%20into%20quoted%20text%20%28for%20example%20in%20%60WHERE%20note%20%3D%20%27--x%27%20GROUP%20BY%20...%60%29%2C%20causing%20the%20injected%20%60%29%60%2F%60WHERE%60%20splice%20to%20corrupt%20SQL%20syntax.%20Restrict%20comment%20detection%20to%20real%20comment%20tokens%20%28or%20use%20tokenizer%20metadata%29%20instead%20of%20raw%20%60rfind%28%22--%22%2C%20...%29%60%20on%20source%20text.%0A%0AValidate%20the%20correctness%20of%20the%20flagged%20issue.%20If%20correct%2C%20How%20can%20I%20resolve%20this%3F%20If%20you%20propose%20a%20fix%2C%20implement%20it%20and%20please%20make%20it%20concise.%0AOnce%20fix%20is%20implemented%2C%20also%20check%20other%20comments%20on%20the%20same%20PR%2C%20and%20ask%20user%20if%20the%20user%20wants%20to%20fix%20the%20rest%20of%20the%20comments%20as%20well.%20if%20said%20yes%2C%20then%20fetch%20all%20the%20comments%20validate%20the%20correctness%20and%20implement%20a%20minimal%20fix%0A)
   
   *(Use Cmd/Ctrl + Click for best experience)*
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/sql/rls_splice.py
   **Line:** 114:118
   **Comment:**
        *Incorrect Condition Logic: The inline-comment detection treats any 
`--` between the start of the line and the clause boundary as a comment, even 
when it appears inside a string literal. This can move the insertion point into 
quoted text (for example in `WHERE note = '--x' GROUP BY ...`), causing the 
injected `)`/`WHERE` splice to corrupt SQL syntax. Restrict comment detection 
to real comment tokens (or use tokenizer metadata) instead of raw `rfind("--", 
...)` on source text.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   Once fix is implemented, also check other comments on the same PR, and ask 
user if the user wants to fix the rest of the comments as well. if said yes, 
then fetch all the comments validate the correctness and implement a minimal fix
   ```
   </details>
   <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F39976&comment_hash=f8a70d7ebd52e3d9a5a7f666cd3f7a10b41af08daa075dd2ab9ce1b15c3db216&reaction=like'>👍</a>
 | <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F39976&comment_hash=f8a70d7ebd52e3d9a5a7f666cd3f7a10b41af08daa075dd2ab9ce1b15c3db216&reaction=dislike'>👎</a>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to