codeant-ai-for-open-source[bot] commented on code in PR #37200: URL: https://github.com/apache/superset/pull/37200#discussion_r2699817933
########## superset/mcp_service/utils/token_utils.py: ########## @@ -0,0 +1,402 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +""" +Token counting and response size utilities for MCP service. + +This module provides utilities to estimate token counts and generate smart +suggestions when responses exceed configured limits. This prevents large +responses from overwhelming LLM clients like Claude Desktop. +""" + +from __future__ import annotations + +import logging +from typing import Any, Dict, List + +logger = logging.getLogger(__name__) + +# Approximate characters per token for estimation +# Claude tokenizer averages ~4 chars per token for English text +# JSON tends to be more verbose, so we use a slightly lower ratio +CHARS_PER_TOKEN = 3.5 + + +def estimate_token_count(text: str | bytes) -> int: + """ + Estimate the token count for a given text. + + Uses a character-based heuristic since we don't have direct access to + the actual tokenizer. This is conservative to avoid underestimating. + + Args: + text: The text to estimate tokens for (string or bytes) + + Returns: + Estimated number of tokens + """ + if isinstance(text, bytes): + text = text.decode("utf-8", errors="replace") + + # Simple heuristic: ~3.5 characters per token for JSON/code + return int(len(text) / CHARS_PER_TOKEN) + + +def estimate_response_tokens(response: Any) -> int: + """ + Estimate token count for an MCP tool response. + + Handles various response types including Pydantic models, dicts, and strings. + + Args: + response: The response object to estimate + + Returns: + Estimated number of tokens + """ + try: + from superset.utils import json + + # Convert response to JSON string for accurate estimation + if hasattr(response, "model_dump"): + # Pydantic model + response_str = json.dumps(response.model_dump()) + elif isinstance(response, (dict, list)): + response_str = json.dumps(response) + elif isinstance(response, (str, bytes)): + response_str = response if isinstance(response, str) else response.decode() + else: + response_str = str(response) + + return estimate_token_count(response_str) + except (TypeError, ValueError, AttributeError) as e: + logger.warning("Failed to estimate response tokens: %s", e) + # Return a high estimate to be safe + return 100000 + + +def get_response_size_bytes(response: Any) -> int: + """ + Get the size of a response in bytes. + + Args: + response: The response object + + Returns: + Size in bytes + """ + try: + from superset.utils import json + + if hasattr(response, "model_dump"): + response_str = json.dumps(response.model_dump()) + elif isinstance(response, (dict, list)): + response_str = json.dumps(response) + elif isinstance(response, bytes): + return len(response) + elif isinstance(response, str): + return len(response.encode("utf-8")) + else: + response_str = str(response) + + return len(response_str.encode("utf-8")) + except (TypeError, ValueError, AttributeError) as e: + logger.warning("Failed to get response size: %s", e) + return 0 Review Comment: **Suggestion:** The except block returns 0 on failure which silently underestimates response size (allowing oversized responses to pass); also it doesn't catch ImportError/UnicodeDecodeError — fall back to computing the byte-length via str(response) and if that fails, return a conservative large value. [logic error] <details> <summary><b>Severity Level:</b> Critical 🚨</summary> ```mdx - ❌ Oversized responses may bypass size checks. - ⚠️ LLM clients (e.g., Claude Desktop) risk receiving huge payloads. - ⚠️ Affected component: MCP response-size determination. ``` </details> ```suggestion except (TypeError, ValueError, AttributeError, ImportError, UnicodeDecodeError) as e: logger.warning("Failed to get response size: %s; falling back to string length", e) try: return len(str(response).encode("utf-8")) except Exception: # Return a large conservative size to avoid silently underestimating return 100000 ``` <details> <summary><b>Steps of Reproduction ✅ </b></summary> ```mdx 1. Start the MCP service with the PR applied. 2. Call a tool that returns a response object that json.dumps cannot serialize (e.g., an object with non-serializable attributes). get_response_size_bytes in superset/mcp_service/utils/token_utils.py (function get_response_size_bytes, lines ~92-119) calls json.dumps(response) at lines ~105-108. 3. json.dumps raises TypeError; execution falls into the except block at lines 117-119 which currently logs and returns 0. 4. Returning 0 underestimates size; the ResponseSizeGuardMiddleware (which uses this utility to decide blocking) may allow an oversized response through because the size check sees 0 bytes. ``` </details> <details> <summary><b>Prompt for AI Agent 🤖 </b></summary> ```mdx This is a comment left during a code review. **Path:** superset/mcp_service/utils/token_utils.py **Line:** 117:119 **Comment:** *Logic Error: The except block returns 0 on failure which silently underestimates response size (allowing oversized responses to pass); also it doesn't catch ImportError/UnicodeDecodeError — fall back to computing the byte-length via str(response) and if that fails, return a conservative large value. Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise. ``` </details> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
