codeant-ai-for-open-source[bot] commented on code in PR #40448:
URL: https://github.com/apache/superset/pull/40448#discussion_r3311141544
##########
superset/mcp_service/utils/sanitization.py:
##########
@@ -212,6 +212,24 @@ def _strip_html_tags(value: str) -> str:
return cleaned.replace("&", "&")
+_DANGEROUS_URL_SCHEME_RE = re.compile(r"\b(javascript|vbscript|data):",
re.IGNORECASE)
+
+
+def _check_dangerous_url_scheme(value: str, field_name: str) -> None:
+ """Raise if ``value`` contains a ``javascript:`` / ``vbscript:`` /
``data:``
+ URL scheme."""
+ if _DANGEROUS_URL_SCHEME_RE.search(value):
+ raise ValueError(f"{field_name} contains potentially malicious URL
scheme")
Review Comment:
**Suggestion:** URL-scheme detection runs on the raw string without first
stripping zero-width/control Unicode, so payloads like `java\u200Bscript:` can
evade the regex and then be normalized into `javascript:` later in the
sanitization pipeline. Normalize with `_remove_dangerous_unicode` before
checking `_DANGEROUS_URL_SCHEME_RE` (or do the normalization inside
`_check_dangerous_url_scheme`) so obfuscated schemes are blocked reliably.
[security]
<details>
<summary><b>Severity Level:</b> Critical 🚨</summary>
```mdx
- ❌ Sanitized chart labels can contain `javascript:` URL schemes.
- ❌ Dangerous URL text may be echoed in MCP/LLM responses.
- ⚠️ Obfuscated schemes in filter values bypass URL checks.
```
</details>
<details>
<summary><b>Steps of Reproduction ✅ </b></summary>
```mdx
1. Call the MCP chart creation tool `generate_chart` at
`superset/mcp_service/chart/tool/generate_chart.py:95` with a config whose
metric label
contains an obfuscated URL scheme, e.g. `"label":
"java\u200Bscript:alert(1)"` (zero-width
space between `a` and `s`).
2. The request JSON is validated into a `GenerateChartRequest` instance by
`SchemaValidator.validate_request()` at
`superset/mcp_service/chart/validation/schema_validator.py:41-56`, which
triggers Pydantic
validators on `ColumnRef` fields in `superset/mcp_service/chart/schemas.py`.
3. During validation of the metric `label`, `ColumnRef.sanitize_label()` at
`superset/mcp_service/chart/schemas.py:811-815` calls
`sanitize_user_input(v, "Label",
max_length=500, allow_empty=True)` from
`superset/mcp_service/utils/sanitization.py:327-332`, which first strips
HTML tags via
`_strip_html_tags()` at `174-212` and then calls
`_check_dangerous_patterns(value,
field_name)` at `233-252`.
4. `_check_dangerous_patterns()` calls `_check_dangerous_url_scheme(value,
field_name)` at
`247`, which uses `_DANGEROUS_URL_SCHEME_RE =
re.compile(r"\b(javascript|vbscript|data):",
re.IGNORECASE)` at `215` to scan the raw string; because the input is
`java\u200Bscript:alert(1)`, the regex at `218-222` does not match and no
`ValueError` is
raised, after which `sanitize_user_input` removes the zero-width character
via
`_remove_dangerous_unicode()` at `283-37` and its call site `389`, returning
the
normalized string `javascript:alert(1)` as a "sanitized" label that can be
echoed back in
chart metadata and MCP/LLM responses with the dangerous scheme now visible.
```
</details>
[Fix in
Cursor](https://app.codeant.ai/fix-in-ide?tool=cursor&prompt_id=6b6b767b026946cd82bee033c5d8e56f&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
| [Fix in VSCode
Claude](https://app.codeant.ai/fix-in-ide?tool=vscode-claude&prompt_id=6b6b767b026946cd82bee033c5d8e56f&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
*(Use Cmd/Ctrl + Click for best experience)*
<details>
<summary><b>Prompt for AI Agent 🤖 </b></summary>
```mdx
This is a comment left during a code review.
**Path:** superset/mcp_service/utils/sanitization.py
**Line:** 218:222
**Comment:**
*Security: URL-scheme detection runs on the raw string without first
stripping zero-width/control Unicode, so payloads like `java\u200Bscript:` can
evade the regex and then be normalized into `javascript:` later in the
sanitization pipeline. Normalize with `_remove_dangerous_unicode` before
checking `_DANGEROUS_URL_SCHEME_RE` (or do the normalization inside
`_check_dangerous_url_scheme`) so obfuscated schemes are blocked reliably.
Validate the correctness of the flagged issue. If correct, How can I resolve
this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask
user if the user wants to fix the rest of the comments as well. if said yes,
then fetch all the comments validate the correctness and implement a minimal fix
```
</details>
<a
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40448&comment_hash=0c4271deb38940d30dd4dee50754b853dd2684362581c6f0f924adce7fd0b78a&reaction=like'>👍</a>
| <a
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40448&comment_hash=0c4271deb38940d30dd4dee50754b853dd2684362581c6f0f924adce7fd0b78a&reaction=dislike'>👎</a>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]