codeant-ai-for-open-source[bot] commented on code in PR #37389:
URL: https://github.com/apache/superset/pull/37389#discussion_r2726129012


##########
superset/mcp_service/system/resources/instance_metadata.py:
##########
@@ -88,12 +91,57 @@ def get_instance_metadata_resource() -> str:
             logger=logger,
         )
 
-        # Use the shared core's resource method
-        return instance_info_core.get_resource()
+        # Get base instance info
+        base_result = json.loads(instance_info_core.get_resource())
+
+        # Remove empty popular_content if it has no useful data
+        popular = base_result.get("popular_content", {})
+        if popular and not any(popular.get(k) for k in popular):
+            del base_result["popular_content"]
+
+        # Add available datasets (top 20 by most recent modification)
+        dataset_dao = instance_info_core.dao_classes["datasets"]
+        try:
+            datasets = dataset_dao.find_all()
+            # Convert to string to avoid TypeError when comparing datetime 
with None
+            sorted_datasets = sorted(
+                datasets,
+                key=lambda d: str(getattr(d, "changed_on", "") or ""),
+                reverse=True,
+            )[:20]

Review Comment:
   **Suggestion:** Resource exhaustion: calling `dataset_dao.find_all()` loads 
all dataset rows into memory. Replace with a paginated DAO call (`list`) that 
fetches only a limited number of rows (top ~20) to avoid loading the entire 
table into memory and then sorting in Python. [resource leak]
   
   <details>
   <summary><b>Severity Level:</b> Critical 🚨</summary>
   
   ```mdx
   - ❌ instance://metadata may OOM on large dataset tables.
   - ⚠️ Metadata calls become slow for many datasets.
   - ⚠️ Affects LLM clients fetching dataset IDs.
   ```
   </details>
   
   ```suggestion
               # Use paginated `list` to avoid loading every dataset into 
memory.
               datasets, _ = dataset_dao.list(
                   page_size=20,
                   columns=["id", "table_name", "schema", "database_id", 
"changed_on"],
               )
               # Keep previous string-based fallback for changed_on to avoid 
datetime comparison errors.
               sorted_datasets = sorted(
                   datasets,
                   key=lambda d: str(getattr(d, "changed_on", "") or ""),
                   reverse=True,
               )
   ```
   <details>
   <summary><b>Steps of Reproduction ✅ </b></summary>
   
   ```mdx
   1. Start MCP service: `python -m superset.mcp_service`.
   
   2. Call the instance metadata resource ("instance://metadata"), which invokes
   get_instance_metadata_resource in
   superset/mcp_service/system/resources/instance_metadata.py (function defined 
near line
   36).
   
   3. Code at lines 103-106 calls `dataset_dao.find_all()` (dataset DAO from
   InstanceInfoCore.dao_classes). BaseDAO.find_all 
(superset/daos/base.py:355-361) executes
   `query.all()`, loading all dataset rows into memory.
   
   4. On installations with many datasets (tens of thousands), this will 
allocate large
   memory, cause long pauses or OOM during the metadata request; you can 
reproduce by
   creating many datasets and invoking the resource and observing high memory 
usage at
   instance_metadata.py:103.
   ```
   </details>
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/mcp_service/system/resources/instance_metadata.py
   **Line:** 105:111
   **Comment:**
        *Resource Leak: Resource exhaustion: calling `dataset_dao.find_all()` 
loads all dataset rows into memory. Replace with a paginated DAO call (`list`) 
that fetches only a limited number of rows (top ~20) to avoid loading the 
entire table into memory and then sorting in Python.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   ```
   </details>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to