codeant-ai-for-open-source[bot] commented on code in PR #39301:
URL: https://github.com/apache/superset/pull/39301#discussion_r3408325824


##########
superset/commands/dataset/importers/v1/utils.py:
##########
@@ -88,12 +117,33 @@ def get_dtype(df: pd.DataFrame, dataset: SqlaTable) -> 
dict[str, VisitableType]:
 
 def validate_data_uri(data_uri: str) -> None:
     """
-    Validate that the data URI is configured on DATASET_IMPORT_ALLOWED_URLS
-    has a valid URL.
+    Validate that the data URI is permitted for dataset import.
+
+    Local ``file://`` URIs are allowed only when the path is confined to the
+    bundled examples folder.  All other URIs must match a pattern in
+    ``DATASET_IMPORT_ALLOWED_DATA_URLS`` *and* resolve to a publicly-routable 
host.
 
-    :param data_uri:
-    :return:
+    :param data_uri: the URI to validate
+    :raises DatasetForbiddenDataURI: if the URI is not permitted
     """
+    if data_uri.startswith("file://"):

Review Comment:
   **Suggestion:** The file-scheme gate is case-sensitive, so mixed-case 
variants like `FiLe://...` bypass the `file://` sandbox check. When 
`DATASET_IMPORT_ALLOW_INTERNAL_DATA_URLS` is enabled and the allowlist regex 
matches broadly, this can allow arbitrary local file reads outside the examples 
directory. Parse the URI scheme via `urlparse(data_uri).scheme.lower()` and 
apply the examples-folder restriction to any `file` scheme regardless of input 
casing. [security]
   
   <details>
   <summary><b>Severity Level:</b> Critical 🚨</summary>
   
   ```mdx
   - ❌ Dataset import can read arbitrary local files via redirects.
   - ❌ Alerts/reports using imported datasets may expose host files.
   - ⚠️ Air-gapped deployments enabling internal URLs become exploitable.
   - ⚠️ Validation logic contradicts file-URI safety comment in config.
   ```
   </details>
   <details>
   <summary><b>Steps of Reproduction ✅ </b></summary>
   
   ```mdx
   1. In configuration, enable internal data URLs by setting
   `DATASET_IMPORT_ALLOW_INTERNAL_DATA_URLS = True` in 
`superset/config.py:10-14` and keep
   the default broad allowlist `DATASET_IMPORT_ALLOWED_DATA_URLS = [r".*"]` as 
defined in
   `superset/config.py:1-8`.
   
   2. Trigger a v1 dataset import that carries a `data` URL, e.g. via the 
dashboard/dataset
   import flow where `import_dataset(config, overwrite=overwrite)` is invoked in
   `superset/commands/dataset/importers/v1/__init__.py:10-18`; 
`import_dataset()` then
   assigns `data_uri = config.get("data")` and calls `load_data(data_uri, 
dataset,
   dataset.database)` when `data_uri` is present (see
   `superset/commands/dataset/importers/v1/utils.py:248-286`).
   
   3. In `load_data()` 
(`superset/commands/dataset/importers/v1/utils.py:327-343`), an
   HTTP(S) source is opened with an `urllib` opener that installs
   `_ValidatingRedirectHandler` (`utils.py:51-72`); when the remote server 
responds with a
   redirect whose `Location` is a mixed-case file URI such as 
`FiLe:///etc/passwd`,
   `_ValidatingRedirectHandler.redirect_request()` builds `absolute_url` and 
calls
   `validate_data_uri(absolute_url)` at `utils.py:70-71`.
   
   4. Inside `validate_data_uri()` (`utils.py:118-164`), the mixed-case 
`FiLe://` URI does
   not satisfy the case-sensitive `if data_uri.startswith("file://"):` check at 
line 129, so
   the examples-folder restriction and `DatasetForbiddenDataURI` checks for 
file URIs are
   skipped; the URI instead matches the broad 
`DATASET_IMPORT_ALLOWED_DATA_URLS` regex at
   `utils.py:147-156`, and because `DATASET_IMPORT_ALLOW_INTERNAL_DATA_URLS` is 
`True` at
   `utils.py:157-163`, no hostname validation is applied to the file URI, 
causing the
   redirect target `FiLe:///etc/passwd` to be treated as allowed and 
subsequently opened by
   `opener.open(data_uri)` in `load_data()` (line 342), enabling reads of 
arbitrary local
   files outside the examples directory.
   ```
   </details>
   
   [Fix in 
Cursor](https://app.codeant.ai/fix-in-ide?tool=cursor&prompt_id=81358197fbcd4b63ae3d0031c65a8f71&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
 | [Fix in VSCode 
Claude](https://app.codeant.ai/fix-in-ide?tool=vscode-claude&prompt_id=81358197fbcd4b63ae3d0031c65a8f71&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
   
   *(Use Cmd/Ctrl + Click for best experience)*
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/commands/dataset/importers/v1/utils.py
   **Line:** 129:129
   **Comment:**
        *Security: The file-scheme gate is case-sensitive, so mixed-case 
variants like `FiLe://...` bypass the `file://` sandbox check. When 
`DATASET_IMPORT_ALLOW_INTERNAL_DATA_URLS` is enabled and the allowlist regex 
matches broadly, this can allow arbitrary local file reads outside the examples 
directory. Parse the URI scheme via `urlparse(data_uri).scheme.lower()` and 
apply the examples-folder restriction to any `file` scheme regardless of input 
casing.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   Once fix is implemented, also check other comments on the same PR, and ask 
user if the user wants to fix the rest of the comments as well. if said yes, 
then fetch all the comments validate the correctness and implement a minimal fix
   ```
   </details>
   <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F39301&comment_hash=2133958eaf2579ada421bd7e5bda4ea72e024c7c0745ef7865b1fdbed10b99da&reaction=like'>👍</a>
 | <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F39301&comment_hash=2133958eaf2579ada421bd7e5bda4ea72e024c7c0745ef7865b1fdbed10b99da&reaction=dislike'>👎</a>



##########
superset/reports/notifications/webhook.py:
##########
@@ -121,9 +151,17 @@ def send(self) -> None:
                     else:
                         data[key] = value
 
-                response = requests.post(wh_url, data=data, files=files, 
timeout=60)
+                response = requests.post(
+                    wh_url,
+                    data=data,
+                    files=files,
+                    timeout=60,
+                    allow_redirects=False,
+                )

Review Comment:
   **Suggestion:** Redirects are now disabled, but the response handling still 
treats 3xx statuses as success because only `>=400` is considered failure. This 
can silently drop notifications when webhook endpoints return 301/302/307, 
since no request is sent to the final target while `send()` logs success. 
Either add explicit 3xx failure handling (and validate `Location` safely) or 
restore safe redirect-follow behavior with host revalidation. [logic error]
   
   <details>
   <summary><b>Severity Level:</b> Major ⚠️</summary>
   
   ```mdx
   - ❌ Webhook-based alerts can be silently dropped on HTTP 3xx.
   - ⚠️ Report scheduler logs success for undelivered webhook notifications.
   - ⚠️ Operators cannot reliably detect misconfigured webhook endpoints.
   - ⚠️ Backoff/retry logic never triggers for persistent 3xx responses.
   ```
   </details>
   <details>
   <summary><b>Steps of Reproduction ✅ </b></summary>
   
   ```mdx
   1. Configure and enable alerts/reports webhooks so that
   `feature_flag_manager.is_feature_enabled("ALERT_REPORT_WEBHOOK")` returns 
True (checked in
   `WebhookNotification.send()` at 
`superset/reports/notifications/webhook.py:135-139`), and
   create a report schedule with a webhook recipient (type 
`ReportRecipientType.WEBHOOK`)
   targeting an endpoint that responds with HTTP 302 or 307.
   
   2. When the report executes, `send_notification_to_recipients()` in
   `superset/commands/report/execute.py:1-23` iterates over `recipients`, calls
   `create_notification(recipient, notification_content)` from
   `superset/reports/notifications/__init__.py:25-34`, and for a webhook 
recipient
   instantiates `WebhookNotification` and invokes `notification.send()` at
   `execute.py:22-23`.
   
   3. Inside `WebhookNotification.send()`
   (`superset/reports/notifications/webhook.py:134-179`), after 
`_validate_webhook_url()`
   passes, the method issues an HTTP request using `requests.post(wh_url, 
data=data,
   files=files, timeout=60, allow_redirects=False)` for payloads with files 
(lines 154-160)
   or `requests.post(wh_url, json=payload, timeout=60, allow_redirects=False)` 
otherwise
   (lines 161-164); if the endpoint returns a redirect (e.g. 302), `requests` 
returns a
   response with `status_code=302` and does not follow the redirect.
   
   4. The subsequent response handling at `webhook.py:166-179` logs `"Webhook 
sent to %s,
   status code: %s"` regardless of 3xx, then raises only on 
`response.status_code >= 500 or
   == 429` (transient) and `response.status_code >= 400` (parameter error); 
because 3xx codes
   are `< 400`, no exception is thrown, `send()` returns successfully to
   `send_notification_to_recipients()`, and the scheduler treats the 
notification as
   delivered even though no request was sent to the ultimate target URL and the 
webhook may
   never be processed.
   ```
   </details>
   
   [Fix in 
Cursor](https://app.codeant.ai/fix-in-ide?tool=cursor&prompt_id=96b0f3b71bd1488c9c81b9c98e577d0a&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
 | [Fix in VSCode 
Claude](https://app.codeant.ai/fix-in-ide?tool=vscode-claude&prompt_id=96b0f3b71bd1488c9c81b9c98e577d0a&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
   
   *(Use Cmd/Ctrl + Click for best experience)*
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/reports/notifications/webhook.py
   **Line:** 154:160
   **Comment:**
        *Logic Error: Redirects are now disabled, but the response handling 
still treats 3xx statuses as success because only `>=400` is considered 
failure. This can silently drop notifications when webhook endpoints return 
301/302/307, since no request is sent to the final target while `send()` logs 
success. Either add explicit 3xx failure handling (and validate `Location` 
safely) or restore safe redirect-follow behavior with host revalidation.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   Once fix is implemented, also check other comments on the same PR, and ask 
user if the user wants to fix the rest of the comments as well. if said yes, 
then fetch all the comments validate the correctness and implement a minimal fix
   ```
   </details>
   <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F39301&comment_hash=9ff1953c37e85e73416b9d8e96df89df115b2b1f8d465d63a1275f5980614e1a&reaction=like'>👍</a>
 | <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F39301&comment_hash=9ff1953c37e85e73416b9d8e96df89df115b2b1f8d465d63a1275f5980614e1a&reaction=dislike'>👎</a>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to