codeant-ai-for-open-source[bot] commented on code in PR #40448:
URL: https://github.com/apache/superset/pull/40448#discussion_r3311141544


##########
superset/mcp_service/utils/sanitization.py:
##########
@@ -212,6 +212,24 @@ def _strip_html_tags(value: str) -> str:
     return cleaned.replace("&", "&")
 
 
+_DANGEROUS_URL_SCHEME_RE = re.compile(r"\b(javascript|vbscript|data):", 
re.IGNORECASE)
+
+
+def _check_dangerous_url_scheme(value: str, field_name: str) -> None:
+    """Raise if ``value`` contains a ``javascript:`` / ``vbscript:`` / 
``data:``
+    URL scheme."""
+    if _DANGEROUS_URL_SCHEME_RE.search(value):
+        raise ValueError(f"{field_name} contains potentially malicious URL 
scheme")

Review Comment:
   **Suggestion:** URL-scheme detection runs on the raw string without first 
stripping zero-width/control Unicode, so payloads like `java\u200Bscript:` can 
evade the regex and then be normalized into `javascript:` later in the 
sanitization pipeline. Normalize with `_remove_dangerous_unicode` before 
checking `_DANGEROUS_URL_SCHEME_RE` (or do the normalization inside 
`_check_dangerous_url_scheme`) so obfuscated schemes are blocked reliably. 
[security]
   
   <details>
   <summary><b>Severity Level:</b> Critical 🚨</summary>
   
   ```mdx
   - ❌ Sanitized chart labels can contain `javascript:` URL schemes.
   - ❌ Dangerous URL text may be echoed in MCP/LLM responses.
   - ⚠️ Obfuscated schemes in filter values bypass URL checks.
   ```
   </details>
   <details>
   <summary><b>Steps of Reproduction ✅ </b></summary>
   
   ```mdx
   1. Call the MCP chart creation tool `generate_chart` at
   `superset/mcp_service/chart/tool/generate_chart.py:95` with a config whose 
metric label
   contains an obfuscated URL scheme, e.g. `"label": 
"java\u200Bscript:alert(1)"` (zero-width
   space between `a` and `s`).
   
   2. The request JSON is validated into a `GenerateChartRequest` instance by
   `SchemaValidator.validate_request()` at
   `superset/mcp_service/chart/validation/schema_validator.py:41-56`, which 
triggers Pydantic
   validators on `ColumnRef` fields in `superset/mcp_service/chart/schemas.py`.
   
   3. During validation of the metric `label`, `ColumnRef.sanitize_label()` at
   `superset/mcp_service/chart/schemas.py:811-815` calls 
`sanitize_user_input(v, "Label",
   max_length=500, allow_empty=True)` from
   `superset/mcp_service/utils/sanitization.py:327-332`, which first strips 
HTML tags via
   `_strip_html_tags()` at `174-212` and then calls 
`_check_dangerous_patterns(value,
   field_name)` at `233-252`.
   
   4. `_check_dangerous_patterns()` calls `_check_dangerous_url_scheme(value, 
field_name)` at
   `247`, which uses `_DANGEROUS_URL_SCHEME_RE = 
re.compile(r"\b(javascript|vbscript|data):",
   re.IGNORECASE)` at `215` to scan the raw string; because the input is
   `java\u200Bscript:alert(1)`, the regex at `218-222` does not match and no 
`ValueError` is
   raised, after which `sanitize_user_input` removes the zero-width character 
via
   `_remove_dangerous_unicode()` at `283-37` and its call site `389`, returning 
the
   normalized string `javascript:alert(1)` as a "sanitized" label that can be 
echoed back in
   chart metadata and MCP/LLM responses with the dangerous scheme now visible.
   ```
   </details>
   
   [Fix in 
Cursor](https://app.codeant.ai/fix-in-ide?tool=cursor&prompt_id=6b6b767b026946cd82bee033c5d8e56f&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
 | [Fix in VSCode 
Claude](https://app.codeant.ai/fix-in-ide?tool=vscode-claude&prompt_id=6b6b767b026946cd82bee033c5d8e56f&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
   
   *(Use Cmd/Ctrl + Click for best experience)*
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/mcp_service/utils/sanitization.py
   **Line:** 218:222
   **Comment:**
        *Security: URL-scheme detection runs on the raw string without first 
stripping zero-width/control Unicode, so payloads like `java\u200Bscript:` can 
evade the regex and then be normalized into `javascript:` later in the 
sanitization pipeline. Normalize with `_remove_dangerous_unicode` before 
checking `_DANGEROUS_URL_SCHEME_RE` (or do the normalization inside 
`_check_dangerous_url_scheme`) so obfuscated schemes are blocked reliably.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   Once fix is implemented, also check other comments on the same PR, and ask 
user if the user wants to fix the rest of the comments as well. if said yes, 
then fetch all the comments validate the correctness and implement a minimal fix
   ```
   </details>
   <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40448&comment_hash=0c4271deb38940d30dd4dee50754b853dd2684362581c6f0f924adce7fd0b78a&reaction=like'>👍</a>
 | <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40448&comment_hash=0c4271deb38940d30dd4dee50754b853dd2684362581c6f0f924adce7fd0b78a&reaction=dislike'>👎</a>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to