codeant-ai-for-open-source[bot] commented on code in PR #41120:
URL: https://github.com/apache/superset/pull/41120#discussion_r3425030894


##########
superset/config.py:
##########
@@ -1872,6 +1872,21 @@ def engine_context_manager(  # pylint: 
disable=unused-argument
         "pg_read_file",
         "pg_ls_dir",
         "pg_read_binary_file",
+        # PostgreSQL large-object functions: writers can plant arbitrary
+        # bytes on the server filesystem (lo_export, lo_from_bytea, lowrite,
+        # lo_put, lo_create, lo_creat, lo_import), readers can pull bytes back
+        # out (lo_get, loread), and lo_unlink deletes large objects outright.
+        # Defense-in-depth on top of is_mutating()'s function-name check.
+        "lo_from_bytea",
+        "lo_export",
+        "lo_import",
+        "lo_put",
+        "lo_create",
+        "lo_creat",
+        "lowrite",
+        "lo_get",
+        "loread",
+        "lo_unlink",

Review Comment:
   **Suggestion:** The new PostgreSQL large-object denylist is incomplete: 
mutating functions like `lo_truncate`/`lo_truncate64` are not included, so a 
user can still alter large objects via `SELECT lo_truncate(...)`. Because these 
calls are plain function invocations and are not covered by the current 
mutating-function detector, this creates a read-only/DML bypass path unless 
they are explicitly denylisted too. [incomplete implementation]
   
   <details>
   <summary><b>Severity Level:</b> Critical 🚨</summary>
   
   ```mdx
   - ❌ Read-only SQL Lab queries can still truncate large objects.
   - ❌ DML guard `database.allow_dml` bypassable via `lo_truncate` calls.
   - ⚠️ Mutation detection misclassifies large-object truncation as read-only.
   ```
   </details>
   <details>
   <summary><b>Steps of Reproduction βœ… </b></summary>
   
   ```mdx
   1. Configure a PostgreSQL database in Superset with engine name 
`"postgresql"` and default
   `DISALLOWED_SQL_FUNCTIONS` from `superset/config.py:1840-1919`, then set
   `database.allow_dml=False` for that connection (read-only).
   
   2. In SQL Lab, submit `SELECT lo_truncate(12345, 0);` against that database; 
the request
   flows through `superset/sql_lab.py:400-459`, which builds `parsed_script =
   SQLScript(rendered_query, engine=db_engine_spec.engine)` and loads 
`disallowed_functions =
   app.config["DISALLOWED_SQL_FUNCTIONS"].get(db_engine_spec.engine, set())` 
followed by
   `parsed_script.check_functions_present(disallowed_functions)`.
   
   3. Because `lo_truncate`/`lo_truncate64` are not listed in the PostgreSQL 
block at
   `superset/config.py:1875-1890`, `parsed_script.check_functions_present` 
returns False and
   the disallowed-function gate in `sql_lab.py:424-431` does not fire; the code 
then calls
   `parsed_script.has_mutation()` (implemented in 
`superset/sql/parse.py:1748-1754`) to
   enforce `if parsed_script.has_mutation() and not database.allow_dml: raise
   SupersetDMLNotAllowedException()`.
   
   4. Inside `SQLStatement.is_mutating` in `superset/sql/parse.py:620-939`, 
only function
   names in `_MUTATING_FUNCTION_NAMES` (containing `LO_FROM_BYTEA`, 
`LO_EXPORT`, `LO_IMPORT`,
   `LO_PUT`, `LO_CREATE`, `LOWRITE`, `LO_UNLINK`, `SETVAL`, `NEXTVAL`) are 
treated as
   mutating exp.Anonymous calls; `LO_TRUNCATE`/`LO_TRUNCATE64` are missing, so
   `is_mutating()` and thus `has_mutation()` both return False, allowing `SELECT
   lo_truncate(...)` to execute and truncate a large object even though
   `database.allow_dml=False`, demonstrating a read-only/DML bypass via an 
un-denylisted
   mutating large-object function.
   ```
   </details>
   
   [![Fix in 
Cursor](https://new-codeant-butcket.s3.us-west-1.amazonaws.com/badges/fix-in-cursor-flat.svg)](https://app.codeant.ai/fix-in-ide?tool=cursor&prompt_id=bd8f926f341543459dfab70d7d17739b&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
 [![Fix in VSCode 
Claude](https://new-codeant-butcket.s3.us-west-1.amazonaws.com/badges/fix-in-vscode-claude-flat.svg)](https://app.codeant.ai/fix-in-ide?tool=vscode-claude&prompt_id=bd8f926f341543459dfab70d7d17739b&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
   
   *(Use Cmd/Ctrl + Click for best experience)*
   <details>
   <summary><b>Prompt for AI Agent πŸ€– </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/config.py
   **Line:** 1875:1889
   **Comment:**
        *Incomplete Implementation: The new PostgreSQL large-object denylist is 
incomplete: mutating functions like `lo_truncate`/`lo_truncate64` are not 
included, so a user can still alter large objects via `SELECT 
lo_truncate(...)`. Because these calls are plain function invocations and are 
not covered by the current mutating-function detector, this creates a 
read-only/DML bypass path unless they are explicitly denylisted too.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   Once fix is implemented, also check other comments on the same PR, and ask 
user if the user wants to fix the rest of the comments as well. if said yes, 
then fetch all the comments validate the correctness and implement a minimal fix
   ```
   </details>
   <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F41120&comment_hash=2940d9d0ce6a1d8ae5b45b9113de291f2d4cd4ede64fd66ca1e64bd50949c6d4&reaction=like'>πŸ‘</a>
 | <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F41120&comment_hash=2940d9d0ce6a1d8ae5b45b9113de291f2d4cd4ede64fd66ca1e64bd50949c6d4&reaction=dislike'>πŸ‘Ž</a>



##########
superset/models/helpers.py:
##########
@@ -1221,24 +1221,25 @@ def _process_sql_expression(  # pylint: 
disable=too-many-arguments
                     disallowed_functions
                 ):
                     raise 
SupersetDisallowedSQLFunctionException(disallowed_functions)
-                if disallowed_tables and 
parsed.check_tables_present(disallowed_tables):
+                if disallowed_tables:
                     # Report only the tables actually found in the expression,
                     # mirroring the canonical execution-time gate in
                     # `superset.sql_lab._validate_query` so the user-facing
-                    # error doesn't echo the operator's full denylist.
-                    present_tables = {
-                        table.table.lower()
-                        for statement in parsed.statements
-                        for table in statement.tables
-                    }
-                    found_tables = {
-                        table
-                        for table in disallowed_tables
-                        if table.lower() in present_tables
-                    }
-                    raise SupersetDisallowedSQLTableException(
-                        found_tables or disallowed_tables
+                    # error doesn't echo the operator's full denylist. Honors
+                    # schema-qualified denylist entries (e.g.
+                    # ``information_schema.tables``) and resolves unqualified
+                    # references against the effective schema, falling back to
+                    # the database default when none was supplied (e.g. a
+                    # virtual dataset with no schema) so a qualified entry
+                    # cannot be bypassed by an unqualified reference.
+                    effective_schema = schema or 
self.database.get_default_schema(
+                        self.catalog
                     )

Review Comment:
   **Suggestion:** This new fallback performs a database-inspector lookup for 
every adhoc expression when `schema` is empty, and `_process_sql_expression` is 
called inside loops for selected columns/metrics/order-bys. That creates an N+1 
pattern of extra DB round-trips in chart queries (and can be especially 
expensive on remote/OAuth-backed databases). Resolve the effective schema once 
per query build and reuse it across expression validations. [performance]
   
   <details>
   <summary><b>Severity Level:</b> Major ⚠️</summary>
   
   ```mdx
   - ⚠️ Explore queries on schema-less datasets incur extra inspector calls.
   - ⚠️ SQL Lab adhoc columns on schema-less queries add metadata trips.
   - ⚠️ Latency increases for remote or OAuth-backed databases during chart 
build.
   ```
   </details>
   <details>
   <summary><b>Steps of Reproduction βœ… </b></summary>
   
   ```mdx
   1. Configure a database engine with DISALLOWED_SQL_TABLES set for its engine 
name so the
   table denylist branch in `_process_sql_expression` executes (see
   `superset/models/helpers.py:1199-1208` where `disallowed_tables` is read 
from `app.config`
   and gated).
   
   2. Use a SQLAlchemy dataset (`SqlaTable`) with `schema` left NULL (schema is 
explicitly
   optional per the comment and nullable column at
   `superset/connectors/sqla/models.py:1322-1327`), e.g. a virtual dataset where
   `SqlaTable.sql` is populated and `SqlaTable.schema` is None.
   
   3. Build a chart on this dataset with multiple adhoc SQL expressions 
(metrics, groupby,
   and/or ORDER BY), triggering per-expression calls to 
`_process_select_expression` and
   `_process_orderby_expression` that forward to `_process_sql_expression` (see
   `superset/models/helpers.py:1245-1268`, 
`superset/models/helpers.py:1279-1293`, and call
   sites in `superset/connectors/sqla/models.py:1648-1687`,
   `superset/models/helpers.py:2440-2479`, 
`superset/models/helpers.py:3220-3233`, and
   `superset/models/helpers.py:3140-3179`).
   
   4. For each such expression where `schema` is empty, 
`_process_sql_expression` computes
   `effective_schema = schema or self.database.get_default_schema(self.catalog)`
   (`superset/models/helpers.py:1235-1237`), which delegates to 
`Database.get_default_schema`
   and then `BaseEngineSpec.get_default_schema` 
(`superset/models/core.py:657-661`,
   `superset/db_engine_specs/base.py:926-931`); the latter opens a fresh 
inspector via
   `database.get_inspector(...)` on every call, creating an N+1 pattern of 
metadata
   round-trips proportional to the number of adhoc expressions in the chart.
   ```
   </details>
   
   [![Fix in 
Cursor](https://new-codeant-butcket.s3.us-west-1.amazonaws.com/badges/fix-in-cursor-flat.svg)](https://app.codeant.ai/fix-in-ide?tool=cursor&prompt_id=922deeefa3e149f2bb2c93a163300f0a&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
 [![Fix in VSCode 
Claude](https://new-codeant-butcket.s3.us-west-1.amazonaws.com/badges/fix-in-vscode-claude-flat.svg)](https://app.codeant.ai/fix-in-ide?tool=vscode-claude&prompt_id=922deeefa3e149f2bb2c93a163300f0a&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
   
   *(Use Cmd/Ctrl + Click for best experience)*
   <details>
   <summary><b>Prompt for AI Agent πŸ€– </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/models/helpers.py
   **Line:** 1235:1237
   **Comment:**
        *Performance: This new fallback performs a database-inspector lookup 
for every adhoc expression when `schema` is empty, and 
`_process_sql_expression` is called inside loops for selected 
columns/metrics/order-bys. That creates an N+1 pattern of extra DB round-trips 
in chart queries (and can be especially expensive on remote/OAuth-backed 
databases). Resolve the effective schema once per query build and reuse it 
across expression validations.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   Once fix is implemented, also check other comments on the same PR, and ask 
user if the user wants to fix the rest of the comments as well. if said yes, 
then fetch all the comments validate the correctness and implement a minimal fix
   ```
   </details>
   <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F41120&comment_hash=cca2dff418b6af7b3a5ab2b6d5d8b967fba609608d7b97ff20bff13fed678488&reaction=like'>πŸ‘</a>
 | <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F41120&comment_hash=cca2dff418b6af7b3a5ab2b6d5d8b967fba609608d7b97ff20bff13fed678488&reaction=dislike'>πŸ‘Ž</a>



##########
superset/models/helpers.py:
##########
@@ -1465,19 +1466,23 @@ def _raise_for_disallowed_sql(self, sql: str) -> None:
             disallowed_functions
         ):
             raise SupersetDisallowedSQLFunctionException(disallowed_functions)
-        if disallowed_tables and 
parsed_script.check_tables_present(disallowed_tables):
+        if disallowed_tables:
             # Report only the tables actually found in the query, mirroring the
             # canonical execution-time gate so the user-facing error doesn't
-            # echo the operator's full denylist.
-            present_tables = {
-                table.table.lower()
-                for statement in parsed_script.statements
-                for table in statement.tables
-            }
-            found_tables = {
-                table for table in disallowed_tables if table.lower() in 
present_tables
-            }
-            raise SupersetDisallowedSQLTableException(found_tables or 
disallowed_tables)
+            # echo the operator's full denylist. Honors schema-qualified
+            # denylist entries (e.g. ``information_schema.tables``) and 
resolves
+            # unqualified references against the effective schema, falling back
+            # to the database default when the datasource has none (e.g. a
+            # virtual dataset) so a qualified entry cannot be bypassed by an
+            # unqualified reference.
+            effective_schema = self.schema or self.database.get_default_schema(
+                self.catalog
+            )

Review Comment:
   **Suggestion:** This uses `get_default_schema`, but the canonical SQL 
execution gate resolves schema with query-aware logic 
(`get_default_schema_for_query`) to account for dynamic-schema engines and 
URI/connect-arg schema rules. Using the static resolver here can produce a 
different effective schema than runtime, causing denylist checks to be 
inconsistent (false allow or false block) for unqualified table references. 
Align this path with the query-aware schema resolution used by SQL 
Lab/executor. [api mismatch]
   
   <details>
   <summary><b>Severity Level:</b> Critical 🚨</summary>
   
   ```mdx
   - ❌ Explore queries may bypass DISALLOWED_SQL_TABLES under dynamic schemas.
   - ❌ SQL Lab and Explore enforce table denylist inconsistently.
   - ⚠️ Operators cannot rely on uniform denylist behavior across surfaces.
   ```
   </details>
   <details>
   <summary><b>Steps of Reproduction βœ… </b></summary>
   
   ```mdx
   1. Note that SQL Lab and the SQL executor resolve the effective schema for 
denylist checks
   through the query-aware path: `sql_lab.execute_sql_statements` and the 
estimate command
   call `database.get_default_schema_for_query(query)` once and reuse it (see
   `superset/sql_lab.py:420-479` and 
`superset/commands/sql_lab/estimate.py:90-129`), while
   the executor resolves it via `_resolve_query_schema` which builds a 
transient `Query` and
   calls `Database.get_default_schema_for_query` (see
   `superset/sql/execution/executor.py:714-753`).
   
   2. Observe that `Database.get_default_schema_for_query` ultimately uses
   `BaseEngineSpec.get_default_schema_for_query` 
(`superset/models/core.py:663-681`,
   `superset/db_engine_specs/base.py:945-1024`), which is explicitly designed 
to honor
   dynamic-schema engines and URI/connect-arg configured schemas (e.g. Presto, 
Snowflake,
   StarRocks override `get_schema_from_engine_params` at
   `superset/db_engine_specs/presto.py:356-395`,
   `superset/db_engine_specs/snowflake.py:235-243`,
   `superset/db_engine_specs/starrocks.py:296-304`) and only falls back to
   `get_default_schema` as a last resort.
   
   3. In contrast, the chart/Explore path's gate `_raise_for_disallowed_sql` 
(called for
   every non–SQL Lab query via `SqlaTable.query` at 
`superset/models/helpers.py:189-201`)
   computes `effective_schema = self.schema or
   self.database.get_default_schema(self.catalog)` 
(`superset/models/helpers.py:1478-1480`),
   which bypasses `get_default_schema_for_query` and directly invokes
   `BaseEngineSpec.get_default_schema` (inspector-based,
   `superset/db_engine_specs/base.py:926-931`), ignoring per-query schema 
resolution and any
   engine-specific URI/connect-arg rules used in the canonical execution gate.
   
   4. On an engine where `get_default_schema_for_query` can differ from
   `get_default_schema`β€”for example, a Presto connection whose schema is 
derived from the
   SQLAlchemy URI via `PrestoEngineSpec.get_schema_from_engine_params`
   (`superset/db_engine_specs/presto.py:356-395`)β€”run a query containing an 
unqualified
   reference to a denylisted table (DISALLOWED_SQL_TABLES read in 
`_raise_for_disallowed_sql`
   at `superset/models/helpers.py:1458-1462` and enforced via
   `parsed_script.get_disallowed_tables(..., effective_schema)` at
   `superset/models/helpers.py:1481-1483`). The SQL Lab path will resolve the 
effective
   schema via `get_default_schema_for_query` and may correctly block it, while 
the Explore
   path uses `get_default_schema` and can compute a different 
`effective_schema`, causing the
   same unqualified reference to be either falsely allowed or falsely blocked, 
violating the
   "mirror the DISALLOWED_SQL_* gate" contract documented in 
`_raise_for_disallowed_sql`'s
   docstring (`superset/models/helpers.py:1453-1457`).
   ```
   </details>
   
   [![Fix in 
Cursor](https://new-codeant-butcket.s3.us-west-1.amazonaws.com/badges/fix-in-cursor-flat.svg)](https://app.codeant.ai/fix-in-ide?tool=cursor&prompt_id=600da8db224345159a17bd9acd84ed37&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
 [![Fix in VSCode 
Claude](https://new-codeant-butcket.s3.us-west-1.amazonaws.com/badges/fix-in-vscode-claude-flat.svg)](https://app.codeant.ai/fix-in-ide?tool=vscode-claude&prompt_id=600da8db224345159a17bd9acd84ed37&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
   
   *(Use Cmd/Ctrl + Click for best experience)*
   <details>
   <summary><b>Prompt for AI Agent πŸ€– </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/models/helpers.py
   **Line:** 1478:1480
   **Comment:**
        *Api Mismatch: This uses `get_default_schema`, but the canonical SQL 
execution gate resolves schema with query-aware logic 
(`get_default_schema_for_query`) to account for dynamic-schema engines and 
URI/connect-arg schema rules. Using the static resolver here can produce a 
different effective schema than runtime, causing denylist checks to be 
inconsistent (false allow or false block) for unqualified table references. 
Align this path with the query-aware schema resolution used by SQL Lab/executor.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   Once fix is implemented, also check other comments on the same PR, and ask 
user if the user wants to fix the rest of the comments as well. if said yes, 
then fetch all the comments validate the correctness and implement a minimal fix
   ```
   </details>
   <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F41120&comment_hash=4c6e68d68eee5bdcd2af67805e31f5676738be4da6ceb863458fa7d5ac15f869&reaction=like'>πŸ‘</a>
 | <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F41120&comment_hash=4c6e68d68eee5bdcd2af67805e31f5676738be4da6ceb863458fa7d5ac15f869&reaction=dislike'>πŸ‘Ž</a>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to