codeant-ai-for-open-source[bot] commented on code in PR #38488:
URL: https://github.com/apache/superset/pull/38488#discussion_r2899453713


##########
superset/sql/parse.py:
##########
@@ -553,13 +553,18 @@ class SQLStatement(BaseSQLStatement[exp.Expression]):
     This class is used for all engines with dialects that can be parsed using 
sqlglot.
     """
 
+    # Optimizer hints have the form /*+ ... */
+    _OPTIMIZER_HINT_PATTERN = re.compile(r"/\*\+")
+
     def __init__(
         self,
         statement: str | None = None,
         engine: str = "base",
         ast: exp.Expression | None = None,
     ):
         self._dialect = SQLGLOT_DIALECTS.get(engine)
+        # Store original statement to detect optimizer hints
+        self._original_statement: str | None = statement

Review Comment:
   **Suggestion:** Optimizer hint detection in `format()` relies on 
`_original_statement`, but when `SQLStatement` is instantiated from an AST 
(e.g., via `split_script` or `optimize`), `_original_statement` is left as 
`None`, so queries that were parsed into ASTs lose their original text and any 
optimizer hints they contained, causing comment preservation to remain enabled 
and the original hint/comment corruption bug to persist for those code paths. 
Initializing `_original_statement` from the AST when no raw statement is 
provided ensures consistent hint detection. [logic error]
   
   <details>
   <summary><b>Severity Level:</b> Critical 🚨</summary>
   
   ```mdx
   - ❌ SQLScript.format outputs invalid SQL with optimizer hints.
   - ⚠️ Optimized statements lose hint-aware comment handling entirely.
   ```
   </details>
   
   ```suggestion
           self._original_statement: str | None = (
               statement or (ast.sql() if ast is not None else None)
           )
   ```
   <details>
   <summary><b>Steps of Reproduction ✅ </b></summary>
   
   ```mdx
   1. In `superset/sql/parse.py:~1190`, construct a `SQLScript` with a 
StarRocks/MySQL query
   containing an optimizer hint, e.g. `SQLScript("SELECT /*+ 
SET_VAR(query_timeout=5) */ col
   FROM t -- inline", engine="starrocks")`.
   
   2. Inside `SQLScript.__init__` (`superset/sql/parse.py:~1180`), 
`statement_class` is
   `SQLStatement`, and `statement_class.split_script(script, engine)` is called.
   
   3. In `SQLStatement.split_script` (`superset/sql/parse.py:~600`), 
`_parse(script, engine)`
   returns `exp.Expression` ASTs; `SQLStatement(ast=ast, engine=engine)` is 
then invoked for
   each AST with `statement=None`, so in `SQLStatement.__init__`
   (`superset/sql/parse.py:567`) `self._original_statement` is set to `None`.
   
   4. When the script is pretty-printed (e.g. `SQLScript.format(comments=True)` 
in
   `superset/sql/parse.py:~1200`), it calls `SQLStatement.format(comments)` for 
each
   statement; in `SQLStatement.format` (`superset/sql/parse.py:726-744`)
   `has_optimizer_hints` is computed from `self._original_statement` which is 
`None`, so
   `effective_comments` stays `True` and `Dialect.generate(..., comments=True, 
pretty=True)`
   lets sqlglot reposition comments inside the `/*+ ... */` hint block, 
reproducing the
   original syntax-corruption bug for this code path.
   ```
   </details>
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/sql/parse.py
   **Line:** 567:567
   **Comment:**
        *Logic Error: Optimizer hint detection in `format()` relies on 
`_original_statement`, but when `SQLStatement` is instantiated from an AST 
(e.g., via `split_script` or `optimize`), `_original_statement` is left as 
`None`, so queries that were parsed into ASTs lose their original text and any 
optimizer hints they contained, causing comment preservation to remain enabled 
and the original hint/comment corruption bug to persist for those code paths. 
Initializing `_original_statement` from the AST when no raw statement is 
provided ensures consistent hint detection.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   ```
   </details>
   <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F38488&comment_hash=e864f9fd3d85ef2b8dd58162b9daf1e0fe3d9732d6ed071a0804dd9778a40de0&reaction=like'>👍</a>
 | <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F38488&comment_hash=e864f9fd3d85ef2b8dd58162b9daf1e0fe3d9732d6ed071a0804dd9778a40de0&reaction=dislike'>👎</a>



##########
tests/unit_tests/sql/parse_tests.py:
##########
@@ -3134,3 +3134,66 @@ def test_backtick_invalid_sql_still_fails() -> None:
     sql = "SELECT * FROM `table` WHERE"
     with pytest.raises(SupersetParseError):
         SQLScript(sql, "base")
+
+
+def test_optimizer_hint_with_comment() -> None:
+    """
+    Test that optimizer hints are not corrupted by inline comments.
+
+    This is a regression test for issue #38189 where sqlglot incorrectly
+    injects converted inline comments (--) inside optimizer hint blocks
+    (/*+ ... */), breaking StarRocks/MySQL syntax.
+
+    The fix detects optimizer hints and disables comment preservation
+    to prevent sqlglot from corrupting the hint syntax.
+    """
+    # SQL with optimizer hint and inline comment
+    sql = """SELECT /*+ SET_VAR(query_timeout = 3000) */ col1, col2
+FROM my_table
+LIMIT 100
+
+-- increase timeout for large scans"""
+
+    script = SQLScript(sql, "mysql")
+    formatted = script.format(comments=True)
+
+    # The optimizer hint should remain intact, not have comments injected 
inside
+    assert "/*+ SET_VAR(query_timeout = 3000) */" in formatted
+    # The inline comment should be stripped (not converted to /* */ inside 
hint)
+    assert "query_timeout /* increase timeout" not in formatted
+    assert "3000) */" in formatted  # Hint closing should be intact
+
+
+def test_optimizer_hint_disabled_comments() -> None:
+    """
+    Test that comments are disabled when optimizer hints are present.
+    """
+    sql = "SELECT /*+ SET_VAR(query_timeout = 3000) */ col FROM t"
+
+    script = SQLScript(sql, "mysql")
+    statement = script.statements[0]
+
+    # Check that the optimizer hint pattern is detected
+    assert statement._original_statement is not None
+    assert "/*+ SET_VAR(query_timeout = 3000) */" in 
statement._original_statement
+
+    # When formatting with comments=True, it should still disable comments
+    # due to optimizer hint detection
+    formatted = statement.format(comments=True)
+    assert "/*+ SET_VAR(query_timeout = 3000) */" in formatted
+

Review Comment:
   **Suggestion:** The test that claims to verify that comments are disabled 
when optimizer hints are present never includes any inline comments in the SQL, 
so it doesn't actually exercise or assert the behavior it describes; as a 
result, regressions in comment handling around optimizer hints could pass this 
test undetected. You should include an inline comment in the SQL and assert 
that it is not preserved in the formatted output to truly validate the intended 
behavior. [logic error]
   
   <details>
   <summary><b>Severity Level:</b> Major ⚠️</summary>
   
   ```mdx
   - ⚠️ Optimizer hint comment regression may pass unit tests.
   - ⚠️ StarRocks/MySQL hint syntax corruption risk resurfaces silently.
   - ⚠️ CI coverage around sqlglot hint handling weakened.
   ```
   </details>
   
   ```suggestion
       def test_optimizer_hint_disabled_comments() -> None:
           """
           Test that comments are disabled when optimizer hints are present.
           """
           sql = """SELECT /*+ SET_VAR(query_timeout = 3000) */ col FROM t
       -- some inline comment that should not be preserved"""
   
           script = SQLScript(sql, "mysql")
           statement = script.statements[0]
   
           # Check that the optimizer hint pattern is detected
           assert statement._original_statement is not None
           assert "/*+ SET_VAR(query_timeout = 3000) */" in 
statement._original_statement
   
           # When formatting with comments=True, it should still disable 
comments
           # due to optimizer hint detection
           formatted = statement.format(comments=True)
           assert "/*+ SET_VAR(query_timeout = 3000) */" in formatted
           # The inline comment should not be preserved when optimizer hints 
are present
           assert "some inline comment" not in formatted
   ```
   <details>
   <summary><b>Steps of Reproduction ✅ </b></summary>
   
   ```mdx
   1. Open `tests/unit_tests/sql/parse_tests.py` at lines 3168–3172 and read
   `test_optimizer_hint_disabled_comments`, whose docstring says "Test that 
comments are
   disabled when optimizer hints are present."
   
   2. Inspect the `sql` definition at line 3172: `sql = "SELECT /*+ 
SET_VAR(query_timeout =
   3000) */ col FROM t"`, which contains an optimizer hint but no `--` inline 
comment or
   other non-hint comments.
   
   3. Follow the test flow at lines 3174–3184: `SQLScript(sql, "mysql")` is 
created,
   `statement.format(comments=True)` is called, and the only behavior asserted 
is that the
   optimizer hint substring `"/*+ SET_VAR(query_timeout = 3000) */"` remains 
present in
   `formatted`.
   
   4. Note that nowhere in this test is any comment text asserted to be removed 
or not
   preserved, so the "comments are disabled" behavior is never exercised; a 
future regression
   in comment handling around optimizer hints in `SQLScript`/`SQLStatement` 
(implemented in
   `superset/sql/parse.py`, imported at the top of this file) could occur while 
this specific
   test still passes.
   ```
   </details>
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** tests/unit_tests/sql/parse_tests.py
   **Line:** 3168:3184
   **Comment:**
        *Logic Error: The test that claims to verify that comments are disabled 
when optimizer hints are present never includes any inline comments in the SQL, 
so it doesn't actually exercise or assert the behavior it describes; as a 
result, regressions in comment handling around optimizer hints could pass this 
test undetected. You should include an inline comment in the SQL and assert 
that it is not preserved in the formatted output to truly validate the intended 
behavior.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   ```
   </details>
   <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F38488&comment_hash=e895aa207be9f82bf70bdecbf7635fc75b8be93bfa8230454b1d9176b66d0626&reaction=like'>👍</a>
 | <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F38488&comment_hash=e895aa207be9f82bf70bdecbf7635fc75b8be93bfa8230454b1d9176b66d0626&reaction=dislike'>👎</a>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to