codeant-ai-for-open-source[bot] commented on code in PR #41120:
URL: https://github.com/apache/superset/pull/41120#discussion_r3425030894
##########
superset/config.py:
##########
@@ -1872,6 +1872,21 @@ def engine_context_manager( # pylint:
disable=unused-argument
"pg_read_file",
"pg_ls_dir",
"pg_read_binary_file",
+ # PostgreSQL large-object functions: writers can plant arbitrary
+ # bytes on the server filesystem (lo_export, lo_from_bytea, lowrite,
+ # lo_put, lo_create, lo_creat, lo_import), readers can pull bytes back
+ # out (lo_get, loread), and lo_unlink deletes large objects outright.
+ # Defense-in-depth on top of is_mutating()'s function-name check.
+ "lo_from_bytea",
+ "lo_export",
+ "lo_import",
+ "lo_put",
+ "lo_create",
+ "lo_creat",
+ "lowrite",
+ "lo_get",
+ "loread",
+ "lo_unlink",
Review Comment:
**Suggestion:** The new PostgreSQL large-object denylist is incomplete:
mutating functions like `lo_truncate`/`lo_truncate64` are not included, so a
user can still alter large objects via `SELECT lo_truncate(...)`. Because these
calls are plain function invocations and are not covered by the current
mutating-function detector, this creates a read-only/DML bypass path unless
they are explicitly denylisted too. [incomplete implementation]
<details>
<summary><b>Severity Level:</b> Critical π¨</summary>
```mdx
- β Read-only SQL Lab queries can still truncate large objects.
- β DML guard `database.allow_dml` bypassable via `lo_truncate` calls.
- β οΈ Mutation detection misclassifies large-object truncation as read-only.
```
</details>
<details>
<summary><b>Steps of Reproduction β
</b></summary>
```mdx
1. Configure a PostgreSQL database in Superset with engine name
`"postgresql"` and default
`DISALLOWED_SQL_FUNCTIONS` from `superset/config.py:1840-1919`, then set
`database.allow_dml=False` for that connection (read-only).
2. In SQL Lab, submit `SELECT lo_truncate(12345, 0);` against that database;
the request
flows through `superset/sql_lab.py:400-459`, which builds `parsed_script =
SQLScript(rendered_query, engine=db_engine_spec.engine)` and loads
`disallowed_functions =
app.config["DISALLOWED_SQL_FUNCTIONS"].get(db_engine_spec.engine, set())`
followed by
`parsed_script.check_functions_present(disallowed_functions)`.
3. Because `lo_truncate`/`lo_truncate64` are not listed in the PostgreSQL
block at
`superset/config.py:1875-1890`, `parsed_script.check_functions_present`
returns False and
the disallowed-function gate in `sql_lab.py:424-431` does not fire; the code
then calls
`parsed_script.has_mutation()` (implemented in
`superset/sql/parse.py:1748-1754`) to
enforce `if parsed_script.has_mutation() and not database.allow_dml: raise
SupersetDMLNotAllowedException()`.
4. Inside `SQLStatement.is_mutating` in `superset/sql/parse.py:620-939`,
only function
names in `_MUTATING_FUNCTION_NAMES` (containing `LO_FROM_BYTEA`,
`LO_EXPORT`, `LO_IMPORT`,
`LO_PUT`, `LO_CREATE`, `LOWRITE`, `LO_UNLINK`, `SETVAL`, `NEXTVAL`) are
treated as
mutating exp.Anonymous calls; `LO_TRUNCATE`/`LO_TRUNCATE64` are missing, so
`is_mutating()` and thus `has_mutation()` both return False, allowing `SELECT
lo_truncate(...)` to execute and truncate a large object even though
`database.allow_dml=False`, demonstrating a read-only/DML bypass via an
un-denylisted
mutating large-object function.
```
</details>
[](https://app.codeant.ai/fix-in-ide?tool=cursor&prompt_id=bd8f926f341543459dfab70d7d17739b&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
[](https://app.codeant.ai/fix-in-ide?tool=vscode-claude&prompt_id=bd8f926f341543459dfab70d7d17739b&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
*(Use Cmd/Ctrl + Click for best experience)*
<details>
<summary><b>Prompt for AI Agent π€ </b></summary>
```mdx
This is a comment left during a code review.
**Path:** superset/config.py
**Line:** 1875:1889
**Comment:**
*Incomplete Implementation: The new PostgreSQL large-object denylist is
incomplete: mutating functions like `lo_truncate`/`lo_truncate64` are not
included, so a user can still alter large objects via `SELECT
lo_truncate(...)`. Because these calls are plain function invocations and are
not covered by the current mutating-function detector, this creates a
read-only/DML bypass path unless they are explicitly denylisted too.
Validate the correctness of the flagged issue. If correct, How can I resolve
this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask
user if the user wants to fix the rest of the comments as well. if said yes,
then fetch all the comments validate the correctness and implement a minimal fix
```
</details>
<a
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F41120&comment_hash=2940d9d0ce6a1d8ae5b45b9113de291f2d4cd4ede64fd66ca1e64bd50949c6d4&reaction=like'>π</a>
| <a
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F41120&comment_hash=2940d9d0ce6a1d8ae5b45b9113de291f2d4cd4ede64fd66ca1e64bd50949c6d4&reaction=dislike'>π</a>
##########
superset/models/helpers.py:
##########
@@ -1221,24 +1221,25 @@ def _process_sql_expression( # pylint:
disable=too-many-arguments
disallowed_functions
):
raise
SupersetDisallowedSQLFunctionException(disallowed_functions)
- if disallowed_tables and
parsed.check_tables_present(disallowed_tables):
+ if disallowed_tables:
# Report only the tables actually found in the expression,
# mirroring the canonical execution-time gate in
# `superset.sql_lab._validate_query` so the user-facing
- # error doesn't echo the operator's full denylist.
- present_tables = {
- table.table.lower()
- for statement in parsed.statements
- for table in statement.tables
- }
- found_tables = {
- table
- for table in disallowed_tables
- if table.lower() in present_tables
- }
- raise SupersetDisallowedSQLTableException(
- found_tables or disallowed_tables
+ # error doesn't echo the operator's full denylist. Honors
+ # schema-qualified denylist entries (e.g.
+ # ``information_schema.tables``) and resolves unqualified
+ # references against the effective schema, falling back to
+ # the database default when none was supplied (e.g. a
+ # virtual dataset with no schema) so a qualified entry
+ # cannot be bypassed by an unqualified reference.
+ effective_schema = schema or
self.database.get_default_schema(
+ self.catalog
)
Review Comment:
**Suggestion:** This new fallback performs a database-inspector lookup for
every adhoc expression when `schema` is empty, and `_process_sql_expression` is
called inside loops for selected columns/metrics/order-bys. That creates an N+1
pattern of extra DB round-trips in chart queries (and can be especially
expensive on remote/OAuth-backed databases). Resolve the effective schema once
per query build and reuse it across expression validations. [performance]
<details>
<summary><b>Severity Level:</b> Major β οΈ</summary>
```mdx
- β οΈ Explore queries on schema-less datasets incur extra inspector calls.
- β οΈ SQL Lab adhoc columns on schema-less queries add metadata trips.
- β οΈ Latency increases for remote or OAuth-backed databases during chart
build.
```
</details>
<details>
<summary><b>Steps of Reproduction β
</b></summary>
```mdx
1. Configure a database engine with DISALLOWED_SQL_TABLES set for its engine
name so the
table denylist branch in `_process_sql_expression` executes (see
`superset/models/helpers.py:1199-1208` where `disallowed_tables` is read
from `app.config`
and gated).
2. Use a SQLAlchemy dataset (`SqlaTable`) with `schema` left NULL (schema is
explicitly
optional per the comment and nullable column at
`superset/connectors/sqla/models.py:1322-1327`), e.g. a virtual dataset where
`SqlaTable.sql` is populated and `SqlaTable.schema` is None.
3. Build a chart on this dataset with multiple adhoc SQL expressions
(metrics, groupby,
and/or ORDER BY), triggering per-expression calls to
`_process_select_expression` and
`_process_orderby_expression` that forward to `_process_sql_expression` (see
`superset/models/helpers.py:1245-1268`,
`superset/models/helpers.py:1279-1293`, and call
sites in `superset/connectors/sqla/models.py:1648-1687`,
`superset/models/helpers.py:2440-2479`,
`superset/models/helpers.py:3220-3233`, and
`superset/models/helpers.py:3140-3179`).
4. For each such expression where `schema` is empty,
`_process_sql_expression` computes
`effective_schema = schema or self.database.get_default_schema(self.catalog)`
(`superset/models/helpers.py:1235-1237`), which delegates to
`Database.get_default_schema`
and then `BaseEngineSpec.get_default_schema`
(`superset/models/core.py:657-661`,
`superset/db_engine_specs/base.py:926-931`); the latter opens a fresh
inspector via
`database.get_inspector(...)` on every call, creating an N+1 pattern of
metadata
round-trips proportional to the number of adhoc expressions in the chart.
```
</details>
[](https://app.codeant.ai/fix-in-ide?tool=cursor&prompt_id=922deeefa3e149f2bb2c93a163300f0a&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
[](https://app.codeant.ai/fix-in-ide?tool=vscode-claude&prompt_id=922deeefa3e149f2bb2c93a163300f0a&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
*(Use Cmd/Ctrl + Click for best experience)*
<details>
<summary><b>Prompt for AI Agent π€ </b></summary>
```mdx
This is a comment left during a code review.
**Path:** superset/models/helpers.py
**Line:** 1235:1237
**Comment:**
*Performance: This new fallback performs a database-inspector lookup
for every adhoc expression when `schema` is empty, and
`_process_sql_expression` is called inside loops for selected
columns/metrics/order-bys. That creates an N+1 pattern of extra DB round-trips
in chart queries (and can be especially expensive on remote/OAuth-backed
databases). Resolve the effective schema once per query build and reuse it
across expression validations.
Validate the correctness of the flagged issue. If correct, How can I resolve
this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask
user if the user wants to fix the rest of the comments as well. if said yes,
then fetch all the comments validate the correctness and implement a minimal fix
```
</details>
<a
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F41120&comment_hash=cca2dff418b6af7b3a5ab2b6d5d8b967fba609608d7b97ff20bff13fed678488&reaction=like'>π</a>
| <a
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F41120&comment_hash=cca2dff418b6af7b3a5ab2b6d5d8b967fba609608d7b97ff20bff13fed678488&reaction=dislike'>π</a>
##########
superset/models/helpers.py:
##########
@@ -1465,19 +1466,23 @@ def _raise_for_disallowed_sql(self, sql: str) -> None:
disallowed_functions
):
raise SupersetDisallowedSQLFunctionException(disallowed_functions)
- if disallowed_tables and
parsed_script.check_tables_present(disallowed_tables):
+ if disallowed_tables:
# Report only the tables actually found in the query, mirroring the
# canonical execution-time gate so the user-facing error doesn't
- # echo the operator's full denylist.
- present_tables = {
- table.table.lower()
- for statement in parsed_script.statements
- for table in statement.tables
- }
- found_tables = {
- table for table in disallowed_tables if table.lower() in
present_tables
- }
- raise SupersetDisallowedSQLTableException(found_tables or
disallowed_tables)
+ # echo the operator's full denylist. Honors schema-qualified
+ # denylist entries (e.g. ``information_schema.tables``) and
resolves
+ # unqualified references against the effective schema, falling back
+ # to the database default when the datasource has none (e.g. a
+ # virtual dataset) so a qualified entry cannot be bypassed by an
+ # unqualified reference.
+ effective_schema = self.schema or self.database.get_default_schema(
+ self.catalog
+ )
Review Comment:
**Suggestion:** This uses `get_default_schema`, but the canonical SQL
execution gate resolves schema with query-aware logic
(`get_default_schema_for_query`) to account for dynamic-schema engines and
URI/connect-arg schema rules. Using the static resolver here can produce a
different effective schema than runtime, causing denylist checks to be
inconsistent (false allow or false block) for unqualified table references.
Align this path with the query-aware schema resolution used by SQL
Lab/executor. [api mismatch]
<details>
<summary><b>Severity Level:</b> Critical π¨</summary>
```mdx
- β Explore queries may bypass DISALLOWED_SQL_TABLES under dynamic schemas.
- β SQL Lab and Explore enforce table denylist inconsistently.
- β οΈ Operators cannot rely on uniform denylist behavior across surfaces.
```
</details>
<details>
<summary><b>Steps of Reproduction β
</b></summary>
```mdx
1. Note that SQL Lab and the SQL executor resolve the effective schema for
denylist checks
through the query-aware path: `sql_lab.execute_sql_statements` and the
estimate command
call `database.get_default_schema_for_query(query)` once and reuse it (see
`superset/sql_lab.py:420-479` and
`superset/commands/sql_lab/estimate.py:90-129`), while
the executor resolves it via `_resolve_query_schema` which builds a
transient `Query` and
calls `Database.get_default_schema_for_query` (see
`superset/sql/execution/executor.py:714-753`).
2. Observe that `Database.get_default_schema_for_query` ultimately uses
`BaseEngineSpec.get_default_schema_for_query`
(`superset/models/core.py:663-681`,
`superset/db_engine_specs/base.py:945-1024`), which is explicitly designed
to honor
dynamic-schema engines and URI/connect-arg configured schemas (e.g. Presto,
Snowflake,
StarRocks override `get_schema_from_engine_params` at
`superset/db_engine_specs/presto.py:356-395`,
`superset/db_engine_specs/snowflake.py:235-243`,
`superset/db_engine_specs/starrocks.py:296-304`) and only falls back to
`get_default_schema` as a last resort.
3. In contrast, the chart/Explore path's gate `_raise_for_disallowed_sql`
(called for
every nonβSQL Lab query via `SqlaTable.query` at
`superset/models/helpers.py:189-201`)
computes `effective_schema = self.schema or
self.database.get_default_schema(self.catalog)`
(`superset/models/helpers.py:1478-1480`),
which bypasses `get_default_schema_for_query` and directly invokes
`BaseEngineSpec.get_default_schema` (inspector-based,
`superset/db_engine_specs/base.py:926-931`), ignoring per-query schema
resolution and any
engine-specific URI/connect-arg rules used in the canonical execution gate.
4. On an engine where `get_default_schema_for_query` can differ from
`get_default_schema`βfor example, a Presto connection whose schema is
derived from the
SQLAlchemy URI via `PrestoEngineSpec.get_schema_from_engine_params`
(`superset/db_engine_specs/presto.py:356-395`)βrun a query containing an
unqualified
reference to a denylisted table (DISALLOWED_SQL_TABLES read in
`_raise_for_disallowed_sql`
at `superset/models/helpers.py:1458-1462` and enforced via
`parsed_script.get_disallowed_tables(..., effective_schema)` at
`superset/models/helpers.py:1481-1483`). The SQL Lab path will resolve the
effective
schema via `get_default_schema_for_query` and may correctly block it, while
the Explore
path uses `get_default_schema` and can compute a different
`effective_schema`, causing the
same unqualified reference to be either falsely allowed or falsely blocked,
violating the
"mirror the DISALLOWED_SQL_* gate" contract documented in
`_raise_for_disallowed_sql`'s
docstring (`superset/models/helpers.py:1453-1457`).
```
</details>
[](https://app.codeant.ai/fix-in-ide?tool=cursor&prompt_id=600da8db224345159a17bd9acd84ed37&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
[](https://app.codeant.ai/fix-in-ide?tool=vscode-claude&prompt_id=600da8db224345159a17bd9acd84ed37&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
*(Use Cmd/Ctrl + Click for best experience)*
<details>
<summary><b>Prompt for AI Agent π€ </b></summary>
```mdx
This is a comment left during a code review.
**Path:** superset/models/helpers.py
**Line:** 1478:1480
**Comment:**
*Api Mismatch: This uses `get_default_schema`, but the canonical SQL
execution gate resolves schema with query-aware logic
(`get_default_schema_for_query`) to account for dynamic-schema engines and
URI/connect-arg schema rules. Using the static resolver here can produce a
different effective schema than runtime, causing denylist checks to be
inconsistent (false allow or false block) for unqualified table references.
Align this path with the query-aware schema resolution used by SQL Lab/executor.
Validate the correctness of the flagged issue. If correct, How can I resolve
this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask
user if the user wants to fix the rest of the comments as well. if said yes,
then fetch all the comments validate the correctness and implement a minimal fix
```
</details>
<a
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F41120&comment_hash=4c6e68d68eee5bdcd2af67805e31f5676738be4da6ceb863458fa7d5ac15f869&reaction=like'>π</a>
| <a
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F41120&comment_hash=4c6e68d68eee5bdcd2af67805e31f5676738be4da6ceb863458fa7d5ac15f869&reaction=dislike'>π</a>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]