codeant-ai-for-open-source[bot] commented on code in PR #40746:
URL: https://github.com/apache/superset/pull/40746#discussion_r3395911788
##########
superset/mcp_service/user/schemas.py:
##########
@@ -104,6 +107,31 @@ class UserInfo(BaseModel):
"access via get_user_info; not available in list_users because roles "
"is a relationship, not a selectable column)",
)
+
+ @field_validator("roles", mode="before")
+ @classmethod
+ def _extract_role_names(cls, v: Any) -> list[str] | None:
+ """Coerce Role ORM objects to their .name strings."""
+ if v is None:
+ return None
+ if isinstance(v, str):
+ # Preserve Pydantic's default rejection of bare strings for
list[str].
+ raise ValueError("roles must be a list, not a string")
+ result: list[str] = []
+ for item in v:
+ if isinstance(item, str):
+ result.append(item)
+ continue
+ try:
+ if hasattr(item, "name") and isinstance(item.name, str):
+ result.append(item.name)
Review Comment:
**Suggestion:** Role names are now emitted in MCP responses, but they are
passed through unchanged and are not sanitized/escaped for LLM context like
other user-facing text fields in this serializer. A crafted role name can
inject control delimiters or prompt text into downstream LLM context;
sanitize/escape each role name before adding it to the output. [security]
<details>
<summary><b>Severity Level:</b> Major ⚠️</summary>
```mdx
- ⚠️ get_user_info MCP tool exposes unsanitized role names.
- ⚠️ list_users MCP tool returns raw role names to LLM clients.
- ⚠️ Role metadata can carry unwrapped, instruction-like text.
```
</details>
<details>
<summary><b>Steps of Reproduction ✅ </b></summary>
```mdx
1. In `superset/mcp_service/user/schemas.py:94-109`, the `UserInfo` schema
defines `roles:
list[str] | None`, and the `_extract_role_names` validator at
`superset/mcp_service/user/schemas.py:111-133` converts ORM Role objects
into plain
strings but does not apply `escape_llm_context_delimiters` or
`sanitize_for_llm_context`
to those strings.
2. The serializer `serialize_user_object` in
`superset/mcp_service/user/schemas.py:32-68`
builds the UserInfo instance for MCP tools: it sanitizes `username` using
`escape_llm_context_delimiters` and wraps `first_name` and `last_name` with
`sanitize_for_llm_context` (lines 55-61), but it passes `roles` through
unchanged as the
list of `r.name` values collected at lines 45-52.
3. The MCP tool `get_user_info`
(`superset/mcp_service/user/tool/get_user_info.py:24-27,52-60`) calls this
serializer via
`_serializer` and returns the resulting `UserInfo` to the MCP client;
similarly,
`list_users`
(`superset/mcp_service/user/tool/list_users.py:12-19,21-33,35-65`) uses
`serialize_user_object` for each row and then emits the response via
`result.model_dump(mode="json", context=...)`, exposing `UserInfo.roles`
directly in tool
JSON consumed by LLM agents.
4. Given the sanitization utilities in
`superset/mcp_service/utils/sanitization.py:38-41,80-93,98-181` are
explicitly designed to
wrap and/or escape user-controlled strings before placing them in LLM
context, the fact
that role names bypass these utilities means that a role whose `name`
contains prompt-like
content or custom delimiters (e.g. long instructions or tokens intended to
look like
`<UNTRUSTED-CONTENT>` blocks) will be surfaced verbatim in MCP tool JSON and
into
downstream LLM prompts, unlike other user-facing fields on UserInfo that are
sanitized.
```
</details>
[Fix in
Cursor](https://app.codeant.ai/fix-in-ide?tool=cursor&prompt_id=f32c233ec9b24a84b2f65487bff36fae&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
| [Fix in VSCode
Claude](https://app.codeant.ai/fix-in-ide?tool=vscode-claude&prompt_id=f32c233ec9b24a84b2f65487bff36fae&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
*(Use Cmd/Ctrl + Click for best experience)*
<details>
<summary><b>Prompt for AI Agent 🤖 </b></summary>
```mdx
This is a comment left during a code review.
**Path:** superset/mcp_service/user/schemas.py
**Line:** 122:127
**Comment:**
*Security: Role names are now emitted in MCP responses, but they are
passed through unchanged and are not sanitized/escaped for LLM context like
other user-facing text fields in this serializer. A crafted role name can
inject control delimiters or prompt text into downstream LLM context;
sanitize/escape each role name before adding it to the output.
Validate the correctness of the flagged issue. If correct, How can I resolve
this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask
user if the user wants to fix the rest of the comments as well. if said yes,
then fetch all the comments validate the correctness and implement a minimal fix
```
</details>
<a
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40746&comment_hash=6b0d64fc75d5a1e7ceac1f14f066696978dae81c9ae5ceb42da69bb5b270f458&reaction=like'>👍</a>
| <a
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40746&comment_hash=6b0d64fc75d5a1e7ceac1f14f066696978dae81c9ae5ceb42da69bb5b270f458&reaction=dislike'>👎</a>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]