kaxil commented on code in PR #64173:
URL: https://github.com/apache/airflow/pull/64173#discussion_r2983533616
##########
providers/common/ai/tests/unit/common/ai/utils/test_sql_validation.py:
##########
@@ -159,3 +159,46 @@ def test_select_with_trailing_semicolon(self):
"""Trailing semicolon should not cause multi-statement error."""
result = validate_sql("SELECT 1;")
assert len(result) == 1
+
+
+class TestDataModifyingNodeDetection:
+ """Data-modifying operations hidden inside allowed statement types should
be blocked."""
Review Comment:
All the new tests call `validate_sql()` without `allowed_types`, which means
`allowed_types is None` and the deep scan runs. But `LLMSQLQueryOperator` calls
it with `allowed_types=DEFAULT_ALLOWED_TYPES` — adding a test for that call
pattern would have caught the bypass:
```python
def test_deep_scan_with_explicit_default_types(self):
with pytest.raises(SQLSafetyError, match="Data-modifying operation
'Delete'"):
validate_sql(
"WITH del AS (DELETE FROM users RETURNING *) SELECT * FROM del",
allowed_types=DEFAULT_ALLOWED_TYPES,
dialect="postgres",
)
```
##########
providers/common/ai/src/airflow/providers/common/ai/utils/sql_validation.py:
##########
@@ -97,4 +109,31 @@ def validate_sql(
f"Statement type '{type(stmt).__name__}' is not allowed.
Allowed types: {allowed_names}"
)
+ # Deep scan: reject data-modifying nodes hidden inside otherwise-allowed
statements
+ # (e.g. data-modifying CTEs, SELECT INTO). Only applies when using the
default
+ # read-only allowlist — callers who provide custom allowed_types have
explicitly
+ # opted into non-read-only operations.
+ if allowed_types is None:
Review Comment:
The deep scan only runs when `allowed_types is None`, but
`LLMSQLQueryOperator` always passes `allowed_types=self.allowed_sql_types`
which defaults to `DEFAULT_ALLOWED_TYPES`. Since that's not `None`, the deep
scan is skipped entirely for the operator path — the primary consumer of
LLM-generated SQL.
A fix would be to check whether the caller is using read-only defaults
rather than checking for `None`:
```python
if allowed_types is None or types is DEFAULT_ALLOWED_TYPES:
_check_for_data_modifying_nodes(parsed)
```
This way the deep scan runs both when no `allowed_types` is passed
(toolsets) and when `DEFAULT_ALLOWED_TYPES` is passed explicitly (operator).
##########
providers/common/ai/src/airflow/providers/common/ai/utils/sql_validation.py:
##########
@@ -39,6 +39,18 @@
exp.Except,
)
+# Denylist: expression types that mutate data or schema when found anywhere in
the AST.
+# This catches data-modifying CTEs (e.g. WITH del AS (DELETE …) SELECT …),
+# SELECT INTO, and other constructs that bypass top-level type checks.
+_DATA_MODIFYING_NODES: tuple[type[exp.Expr], ...] = (
+ exp.Insert,
+ exp.Update,
+ exp.Delete,
+ exp.Merge,
+ exp.Into,
Review Comment:
`exp.Command` is sqlglot's fallback for any syntax it doesn't recognize.
This makes the denylist fail-closed, which is good for security, but worth
calling out in the comment since it could block legitimate vendor-specific SQL
that sqlglot can't parse.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]