codeant-ai-for-open-source[bot] commented on code in PR #40746:
URL: https://github.com/apache/superset/pull/40746#discussion_r3395911788


##########
superset/mcp_service/user/schemas.py:
##########
@@ -104,6 +107,31 @@ class UserInfo(BaseModel):
         "access via get_user_info; not available in list_users because roles "
         "is a relationship, not a selectable column)",
     )
+
+    @field_validator("roles", mode="before")
+    @classmethod
+    def _extract_role_names(cls, v: Any) -> list[str] | None:
+        """Coerce Role ORM objects to their .name strings."""
+        if v is None:
+            return None
+        if isinstance(v, str):
+            # Preserve Pydantic's default rejection of bare strings for 
list[str].
+            raise ValueError("roles must be a list, not a string")
+        result: list[str] = []
+        for item in v:
+            if isinstance(item, str):
+                result.append(item)
+                continue
+            try:
+                if hasattr(item, "name") and isinstance(item.name, str):
+                    result.append(item.name)

Review Comment:
   **Suggestion:** Role names are now emitted in MCP responses, but they are 
passed through unchanged and are not sanitized/escaped for LLM context like 
other user-facing text fields in this serializer. A crafted role name can 
inject control delimiters or prompt text into downstream LLM context; 
sanitize/escape each role name before adding it to the output. [security]
   
   <details>
   <summary><b>Severity Level:</b> Major ⚠️</summary>
   
   ```mdx
   - ⚠️ get_user_info MCP tool exposes unsanitized role names.
   - ⚠️ list_users MCP tool returns raw role names to LLM clients.
   - ⚠️ Role metadata can carry unwrapped, instruction-like text.
   ```
   </details>
   <details>
   <summary><b>Steps of Reproduction ✅ </b></summary>
   
   ```mdx
   1. In `superset/mcp_service/user/schemas.py:94-109`, the `UserInfo` schema 
defines `roles:
   list[str] | None`, and the `_extract_role_names` validator at
   `superset/mcp_service/user/schemas.py:111-133` converts ORM Role objects 
into plain
   strings but does not apply `escape_llm_context_delimiters` or 
`sanitize_for_llm_context`
   to those strings.
   
   2. The serializer `serialize_user_object` in 
`superset/mcp_service/user/schemas.py:32-68`
   builds the UserInfo instance for MCP tools: it sanitizes `username` using
   `escape_llm_context_delimiters` and wraps `first_name` and `last_name` with
   `sanitize_for_llm_context` (lines 55-61), but it passes `roles` through 
unchanged as the
   list of `r.name` values collected at lines 45-52.
   
   3. The MCP tool `get_user_info`
   (`superset/mcp_service/user/tool/get_user_info.py:24-27,52-60`) calls this 
serializer via
   `_serializer` and returns the resulting `UserInfo` to the MCP client; 
similarly,
   `list_users` 
(`superset/mcp_service/user/tool/list_users.py:12-19,21-33,35-65`) uses
   `serialize_user_object` for each row and then emits the response via
   `result.model_dump(mode="json", context=...)`, exposing `UserInfo.roles` 
directly in tool
   JSON consumed by LLM agents.
   
   4. Given the sanitization utilities in
   `superset/mcp_service/utils/sanitization.py:38-41,80-93,98-181` are 
explicitly designed to
   wrap and/or escape user-controlled strings before placing them in LLM 
context, the fact
   that role names bypass these utilities means that a role whose `name` 
contains prompt-like
   content or custom delimiters (e.g. long instructions or tokens intended to 
look like
   `<UNTRUSTED-CONTENT>` blocks) will be surfaced verbatim in MCP tool JSON and 
into
   downstream LLM prompts, unlike other user-facing fields on UserInfo that are 
sanitized.
   ```
   </details>
   
   [Fix in 
Cursor](https://app.codeant.ai/fix-in-ide?tool=cursor&prompt_id=f32c233ec9b24a84b2f65487bff36fae&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
 | [Fix in VSCode 
Claude](https://app.codeant.ai/fix-in-ide?tool=vscode-claude&prompt_id=f32c233ec9b24a84b2f65487bff36fae&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
   
   *(Use Cmd/Ctrl + Click for best experience)*
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/mcp_service/user/schemas.py
   **Line:** 122:127
   **Comment:**
        *Security: Role names are now emitted in MCP responses, but they are 
passed through unchanged and are not sanitized/escaped for LLM context like 
other user-facing text fields in this serializer. A crafted role name can 
inject control delimiters or prompt text into downstream LLM context; 
sanitize/escape each role name before adding it to the output.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   Once fix is implemented, also check other comments on the same PR, and ask 
user if the user wants to fix the rest of the comments as well. if said yes, 
then fetch all the comments validate the correctness and implement a minimal fix
   ```
   </details>
   <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40746&comment_hash=6b0d64fc75d5a1e7ceac1f14f066696978dae81c9ae5ceb42da69bb5b270f458&reaction=like'>👍</a>
 | <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40746&comment_hash=6b0d64fc75d5a1e7ceac1f14f066696978dae81c9ae5ceb42da69bb5b270f458&reaction=dislike'>👎</a>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to