codeant-ai-for-open-source[bot] commented on code in PR #39976: URL: https://github.com/apache/superset/pull/39976#discussion_r3226831510
########## superset/sql/rls_splice.py: ########## @@ -0,0 +1,482 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +""" +RLS predicate injection via text splicing. + +Instead of round-tripping through sqlglot's generator (which transpiles +dialect-specific functions like ``LAST_DAY`` into something else), this approach: + + 1. Parses the SQL with sqlglot — only to understand structure (scope tree). + 2. Uses sqlglot's tokenizer to get byte-accurate positions for every token + in the original SQL string. + 3. For each ``SELECT`` scope that references a table with an RLS predicate, + finds the exact byte offset to inject at — either the end of an existing + ``WHERE`` clause, or just before ``GROUP BY`` / ``ORDER BY`` / ``HAVING`` + / ``LIMIT`` / the closing paren of a subquery. + 4. Splices the predicate text directly into the original string at that + offset — never calling ``.sql()``, so the generator never runs. + +Result: everything outside the splice points is the original SQL, byte for +byte. Dialect-specific functions, comments, and formatting are all preserved +exactly. + +Known limitations: + - SQL that fails to parse under the chosen dialect raises a ``ParseError``. + A thin dialect subclass is still required for parsing — but only for + parsing, not generation. + - Predicate strings are spliced in as raw SQL. They must come from a trusted + source (the RLS config), not user input. +""" + +from __future__ import annotations + +from typing import TYPE_CHECKING + +import sqlglot +from sqlglot import exp +from sqlglot.optimizer.scope import traverse_scope +from sqlglot.tokens import Token, TokenType + +if TYPE_CHECKING: + from superset.sql.parse import Table + + +# Token types that end a WHERE clause / FROM section at the current paren depth, +# indicating where a new predicate must be inserted just before. +_CLAUSE_ENDS = { + TokenType.GROUP_BY, + TokenType.HAVING, + TokenType.ORDER_BY, + TokenType.WINDOW, + TokenType.QUALIFY, + TokenType.LIMIT, + TokenType.FETCH, + TokenType.CLUSTER_BY, + TokenType.DISTRIBUTE_BY, + TokenType.SORT_BY, + TokenType.CONNECT_BY, + TokenType.START_WITH, + TokenType.UNION, + TokenType.INTERSECT, + TokenType.EXCEPT, +} + +_JOIN_STARTS = { + TokenType.JOIN, + TokenType.STRAIGHT_JOIN, + TokenType.JOIN_MARKER, +} + + +def _splice_priority(text: str) -> int: + """ + Priority for applying splices at the same offset. + + Insert full SQL fragments (WHERE/ON/predicates) before closing parens so + wrapping splices like ``pred AND (existing)`` compose correctly. + """ + return 1 if text != ")" else 0 + + +def _before_whitespace(sql: str, offset: int) -> int: + """Back up past any whitespace immediately before *offset*.""" + while offset > 0 and sql[offset - 1] in (" ", "\t", "\n", "\r"): + offset -= 1 + return offset + + +def _before_trivia(sql: str, offset: int) -> int: + """ + Back up past whitespace and adjacent comments immediately before *offset*. + + This ensures insertion points land before inline/block comments that appear + between `FROM`/`WHERE` and the next clause keyword. + """ + while True: + offset = _before_whitespace(sql, offset) + + # Inline comment ending at offset, eg: "... -- comment\nGROUP BY". + line_start = sql.rfind("\n", 0, offset) + 1 + inline_comment_start = sql.rfind("--", line_start, offset) + if inline_comment_start != -1: + offset = inline_comment_start + continue Review Comment: **Suggestion:** The inline-comment detection treats any `--` between the start of the line and the clause boundary as a comment, even when it appears inside a string literal. This can move the insertion point into quoted text (for example in `WHERE note = '--x' GROUP BY ...`), causing the injected `)`/`WHERE` splice to corrupt SQL syntax. Restrict comment detection to real comment tokens (or use tokenizer metadata) instead of raw `rfind("--", ...)` on source text. [incorrect condition logic] <details> <summary><b>Severity Level:</b> Critical 🚨</summary> ```mdx - CRITICAL Splice-mode RLS can generate syntactically invalid SQL. - CRITICAL RLS predicates may not constrain rows as intended. - WARN Queries with `--` in strings break on affected engines. - WARN SQL Lab execution fails for specific RLS-filtered queries. ``` </details> <details> <summary><b>Steps of Reproduction ✅ </b></summary> ```mdx 1. Configure an environment where splice-mode RLS is used: set a database engine spec's `rls_method` to `RLSMethod.AS_PREDICATE_SPLICE` (see comment in `superset/db_engine_specs/base.py:560-564` explaining when splice mode should be used). 2. Run a SQL query through SQL Lab (or any path using the SQL executor) against that database with a table that has RLS predicates (so `apply_rls` is actually invoked). The query should include a string literal containing `--` before the next clause boundary, for example: `SELECT * FROM some_table t WHERE t.note = '--x' GROUP BY t.id`. RLS application for SQL Lab goes through `superset/sql/execution/executor.py:440-444` which calls `_apply_rls_to_script`, then `superset/utils/rls.py:32-67` (`apply_rls`) which calls `parsed_statement.apply_rls(...)`. 3. With `method=RLSMethod.AS_PREDICATE_SPLICE`, `SQLStatement.apply_rls` in `superset/sql/parse.py:972-995` dispatches to `_apply_rls_splice`, which in turn calls `apply_rls_splice` in `superset/sql/rls_splice.py:149-198`. For each affected SELECT scope, `_splices_for_scope` (lines 201-255) calls `_find_where_splice` (lines 386-405) to inject predicates into the WHERE clause. 4. Inside `_find_where_splice`, `_find_condition_end` (lines 357-383) is used to locate the end of the existing WHERE condition, and it calls `_before_trivia(sql, tok.start)` at lines 373-381 to back up over comments/whitespace before the `GROUP BY` token. `_before_trivia` at `superset/sql/rls_splice.py:103-127` computes `line_start` and then uses `sql.rfind("--", line_start, offset)` (lines 114-116). For the sample query, this `rfind` matches the `--` inside the string literal `'--x'`, so `inline_comment_start` points into the quoted text. The function then sets `offset = inline_comment_start` (line 118), causing `_find_condition_end` to return a `body_end` index inside the string literal. When `_find_where_splice` inserts `")"` at `body_end` (line 17-20 in the `_find_where_splice` snippet), the closing parenthesis is spliced into the middle of the string literal instead of before `GROUP BY`, corrupting the SQL (typically resulting in a syntax error or malformed condition where the RLS predicate is not actually applied as intended). ``` </details> [Fix in Cursor](https://app.codeant.ai/fix-in-ide?tool=cursor&prompt=This%20is%20a%20comment%20left%20during%20a%20code%20review.%0A%0A%2A%2APath%3A%2A%2A%20superset%2Fsql%2Frls_splice.py%0A%2A%2ALine%3A%2A%2A%20114%3A118%0A%2A%2AComment%3A%2A%2A%0A%09%2AIncorrect%20Condition%20Logic%3A%20The%20inline-comment%20detection%20treats%20any%20%60--%60%20between%20the%20start%20of%20the%20line%20and%20the%20clause%20boundary%20as%20a%20comment%2C%20even%20when%20it%20appears%20inside%20a%20string%20literal.%20This%20can%20move%20the%20insertion%20point%20into%20quoted%20text%20%28for%20example%20in%20%60WHERE%20note%20%3D%20%27--x%27%20GROUP%20BY%20...%60%29%2C%20causing%20the%20injected%20%60%29%60%2F%60WHERE%60%20splice%20to%20corrupt%20SQL%20syntax.%20Restrict%20comment%20detection%20to%20real%20comment%20tokens%20%28or%20use%20tokenizer%20metadata%29%20instead%20of%20raw%20%60rfind%28%22--%22%2C%20...%29%60%20on%20source%20text.%0A%0AValidate%20the%20correctness%20of%20the%20flagged %20issue.%20If%20correct%2C%20How%20can%20I%20resolve%20this%3F%20If%20you%20propose%20a%20fix%2C%20implement%20it%20and%20please%20make%20it%20concise.%0AOnce%20fix%20is%20implemented%2C%20also%20check%20other%20comments%20on%20the%20same%20PR%2C%20and%20ask%20user%20if%20the%20user%20wants%20to%20fix%20the%20rest%20of%20the%20comments%20as%20well.%20if%20said%20yes%2C%20then%20fetch%20all%20the%20comments%20validate%20the%20correctness%20and%20implement%20a%20minimal%20fix%0A) | [Fix in VSCode Claude](https://app.codeant.ai/fix-in-ide?tool=vscode-claude&prompt=This%20is%20a%20comment%20left%20during%20a%20code%20review.%0A%0A%2A%2APath%3A%2A%2A%20superset%2Fsql%2Frls_splice.py%0A%2A%2ALine%3A%2A%2A%20114%3A118%0A%2A%2AComment%3A%2A%2A%0A%09%2AIncorrect%20Condition%20Logic%3A%20The%20inline-comment%20detection%20treats%20any%20%60--%60%20between%20the%20start%20of%20the%20line%20and%20the%20clause%20boundary%20as%20a%20comment%2C%20even%20when%20it%20appears%20inside%20a%20string%2 0literal.%20This%20can%20move%20the%20insertion%20point%20into%20quoted%20text%20%28for%20example%20in%20%60WHERE%20note%20%3D%20%27--x%27%20GROUP%20BY%20...%60%29%2C%20causing%20the%20injected%20%60%29%60%2F%60WHERE%60%20splice%20to%20corrupt%20SQL%20syntax.%20Restrict%20comment%20detection%20to%20real%20comment%20tokens%20%28or%20use%20tokenizer%20metadata%29%20instead%20of%20raw%20%60rfind%28%22--%22%2C%20...%29%60%20on%20source%20text.%0A%0AValidate%20the%20correctness%20of%20the%20flagged%20issue.%20If%20correct%2C%20How%20can%20I%20resolve%20this%3F%20If%20you%20propose%20a%20fix%2C%20implement%20it%20and%20please%20make%20it%20concise.%0AOnce%20fix%20is%20implemented%2C%20also%20check%20other%20comments%20on%20the%20same%20PR%2C%20and%20ask%20user%20if%20the%20user%20wants%20to%20fix%20the%20rest%20of%20the%20comments%20as%20well.%20if%20said%20yes%2C%20then%20fetch%20all%20the%20comments%20validate%20the%20correctness%20and%20implement%20a%20minimal%20fix%0A) *(Use Cmd/Ctrl + Click for best experience)* <details> <summary><b>Prompt for AI Agent 🤖 </b></summary> ```mdx This is a comment left during a code review. **Path:** superset/sql/rls_splice.py **Line:** 114:118 **Comment:** *Incorrect Condition Logic: The inline-comment detection treats any `--` between the start of the line and the clause boundary as a comment, even when it appears inside a string literal. This can move the insertion point into quoted text (for example in `WHERE note = '--x' GROUP BY ...`), causing the injected `)`/`WHERE` splice to corrupt SQL syntax. Restrict comment detection to real comment tokens (or use tokenizer metadata) instead of raw `rfind("--", ...)` on source text. Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise. Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix ``` </details> <a href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F39976&comment_hash=f8a70d7ebd52e3d9a5a7f666cd3f7a10b41af08daa075dd2ab9ce1b15c3db216&reaction=like'>👍</a> | <a href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F39976&comment_hash=f8a70d7ebd52e3d9a5a7f666cd3f7a10b41af08daa075dd2ab9ce1b15c3db216&reaction=dislike'>👎</a> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
