codeant-ai-for-open-source[bot] commented on code in PR #40935:
URL: https://github.com/apache/superset/pull/40935#discussion_r3389489799
##########
superset/mcp_service/dataset/schemas.py:
##########
@@ -444,6 +444,16 @@ class CreateVirtualDatasetRequest(BaseModel):
None,
description="Human-readable description of the dataset (optional).",
)
+ metrics: List[Dict[str, Any]] | None = Field(
+ None,
+ description="Optional list of saved metrics to create. Each dictionary
"
+ "must have 'metric_name' and 'expression'.",
+ )
+ calculated_columns: List[Dict[str, Any]] | None = Field(
+ None,
+ description="Optional list of calculated columns to create. Each
dictionary "
+ "must have 'column_name' and 'expression'.",
+ )
Review Comment:
**Suggestion:** `metrics` and `calculated_columns` accept arbitrary
dictionaries without required-key validation, but downstream update validation
assumes keys like `metric_name`/`column_name` exist and can raise uncaught
`KeyError`, causing a 500. Add explicit typed item schemas or validators that
enforce required fields before command execution. [api mismatch]
<details>
<summary><b>Severity Level:</b> Major ⚠️</summary>
```mdx
- ❌ MCP virtual dataset creation can 500 on bad metrics.
- ⚠️ LLM-generated payloads may intermittently hit uncaught KeyError.
- ⚠️ Clients receive no structured field-level validation feedback.
```
</details>
<details>
<summary><b>Steps of Reproduction ✅ </b></summary>
```mdx
1. Invoke the MCP tool `create_virtual_dataset` (implemented in
`superset/mcp_service/dataset/tool/create_virtual_dataset.py:4-82`) via the
MCP server
with a `CreateVirtualDatasetRequest` payload.
2. Construct the request so that `request.metrics` contains at least one
dict missing
`metric_name`, e.g. `metrics=[{"expression": "SUM(col1)"}]`, while other
required fields
(`database_id`, `sql`, `dataset_name`) are valid (schema at
`superset/mcp_service/dataset/schemas.py:14-57`).
3. Inside `create_virtual_dataset`, after
`CreateDatasetCommand(properties).run()`
succeeds (see `superset/commands/dataset/create.py:42-52`), the code at
`create_virtual_dataset.py:50-60` calls `UpdateDatasetCommand(dataset.id,
{"metrics":
request.metrics}).run()`.
4. `UpdateDatasetCommand.run()`
(`superset/commands/dataset/update.py:83-97`) calls
`self.validate()`, which in `_validate_semantics()` (`update.py:214-224`)
calls
`_validate_metrics(metrics, exceptions)`; `_validate_metrics` calls
`_get_duplicates(metrics, "metric_name")` (`update.py:26-30, 86-93`), which
evaluates
`[item[key] for item in data]` and raises a `KeyError` because `metric_name`
is missing.
This `KeyError` is not caught by the transaction decorator (which only
handles
`SQLAlchemyError`/`ValueError`) nor by the specific
`DatasetInvalidError`/`DatasetCreateFailedError`/`DatasetUpdateFailedError`
handlers in
`create_virtual_dataset.py:83-105`, so it falls through to the generic
`except Exception`
at `create_virtual_dataset.py:106-111` and is re-raised, causing the MCP
tool call to fail
with an unhandled server error instead of a structured validation error.
```
</details>
[Fix in
Cursor](https://app.codeant.ai/fix-in-ide?tool=cursor&prompt_id=38b06c53d245440998004a3247cf9cbd&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
| [Fix in VSCode
Claude](https://app.codeant.ai/fix-in-ide?tool=vscode-claude&prompt_id=38b06c53d245440998004a3247cf9cbd&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
*(Use Cmd/Ctrl + Click for best experience)*
<details>
<summary><b>Prompt for AI Agent 🤖 </b></summary>
```mdx
This is a comment left during a code review.
**Path:** superset/mcp_service/dataset/schemas.py
**Line:** 447:456
**Comment:**
*Api Mismatch: `metrics` and `calculated_columns` accept arbitrary
dictionaries without required-key validation, but downstream update validation
assumes keys like `metric_name`/`column_name` exist and can raise uncaught
`KeyError`, causing a 500. Add explicit typed item schemas or validators that
enforce required fields before command execution.
Validate the correctness of the flagged issue. If correct, How can I resolve
this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask
user if the user wants to fix the rest of the comments as well. if said yes,
then fetch all the comments validate the correctness and implement a minimal fix
```
</details>
<a
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40935&comment_hash=cfb28253c76c8832a5a3711ebc2cd4d1470302077ce2bdae78f5b4d3dcb3afb3&reaction=like'>👍</a>
| <a
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40935&comment_hash=cfb28253c76c8832a5a3711ebc2cd4d1470302077ce2bdae78f5b4d3dcb3afb3&reaction=dislike'>👎</a>
##########
superset/mcp_service/dataset/tool/create_virtual_dataset.py:
##########
@@ -85,6 +86,18 @@ async def create_virtual_dataset(
dataset = CreateDatasetCommand(properties).run()
+ if request.metrics or request.calculated_columns:
+ from superset.commands.dataset.update import
UpdateDatasetCommand
+
+ update_props: dict[str, Any] = {}
+ if request.metrics:
+ update_props["metrics"] = request.metrics
Review Comment:
**Suggestion:** Passing only new `metrics` to `UpdateDatasetCommand` can
remove pre-existing dataset metrics, since update semantics replace the metric
set rather than append to it. Include existing metrics in the update payload
(plus new ones) or use an append-specific path. [logic error]
<details>
<summary><b>Severity Level:</b> Major ⚠️</summary>
```mdx
- ⚠️ Auto-generated default metrics silently removed from new datasets.
- ⚠️ Any pre-existing saved metrics are fully replaced.
- ⚠️ MCP clients cannot rely on default metric availability.
```
</details>
<details>
<summary><b>Steps of Reproduction ✅ </b></summary>
```mdx
1. Use the MCP `create_virtual_dataset` tool
(`superset/mcp_service/dataset/tool/create_virtual_dataset.py:4-82`) with a
`CreateVirtualDatasetRequest` that produces a valid dataset (so
`CreateDatasetCommand`
succeeds) and provide `metrics` containing only the new logical metrics you
want to add,
e.g. `[{"metric_name": "m1", "expression": "SUM(col1)"}]`, while the
database layer is
configured to auto-populate default metrics via `SqlaTable.fetch_metadata()`
(`superset/connectors/sqla/models.py:11-28, 31-43, 55-15`).
2. After `CreateDatasetCommand(properties).run()`
(`superset/commands/dataset/create.py:42-52`) and
`dataset.fetch_metadata()`, the dataset
has an initial set of metrics (derived from `database.get_metrics(...)` and
`add_missing_metrics`, `models.py:18-27, 31-15`), e.g. default `count`
metrics.
3. Because `request.metrics` is truthy, `create_virtual_dataset` executes
the block at
`create_virtual_dataset.py:50-60`, building `update_props["metrics"] =
request.metrics`
(containing only the new metric definitions) and calling
`UpdateDatasetCommand(dataset.id,
update_props).run()`.
4. Inside `UpdateDatasetCommand.run()`
(`superset/commands/dataset/update.py:83-97`),
`DatasetDAO.update(self._model, attributes=self._properties)` is invoked;
`DatasetDAO.update` (`superset/daos/dataset.py:20-32`) sees `"metrics"` and
calls
`update_metrics(model, attributes.pop("metrics"))`. `update_metrics`
(`dataset.py:114-158`) bulk-inserts metrics without `id` and then deletes
all existing
metrics whose ids are not present in `property_metrics_by_id` (empty for
new-only
metrics), thereby removing any pre-existing metrics (including
auto-populated ones), so
the final dataset retains only the newly submitted `metrics` list.
```
</details>
[Fix in
Cursor](https://app.codeant.ai/fix-in-ide?tool=cursor&prompt_id=05cf9400f6fe495db712083e9180f31e&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
| [Fix in VSCode
Claude](https://app.codeant.ai/fix-in-ide?tool=vscode-claude&prompt_id=05cf9400f6fe495db712083e9180f31e&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
*(Use Cmd/Ctrl + Click for best experience)*
<details>
<summary><b>Prompt for AI Agent 🤖 </b></summary>
```mdx
This is a comment left during a code review.
**Path:** superset/mcp_service/dataset/tool/create_virtual_dataset.py
**Line:** 93:94
**Comment:**
*Logic Error: Passing only new `metrics` to `UpdateDatasetCommand` can
remove pre-existing dataset metrics, since update semantics replace the metric
set rather than append to it. Include existing metrics in the update payload
(plus new ones) or use an append-specific path.
Validate the correctness of the flagged issue. If correct, How can I resolve
this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask
user if the user wants to fix the rest of the comments as well. if said yes,
then fetch all the comments validate the correctness and implement a minimal fix
```
</details>
<a
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40935&comment_hash=922b7dea5def2133e6a91942ece82cb9c8f2da726460665f9fce223edf6ce954&reaction=like'>👍</a>
| <a
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40935&comment_hash=922b7dea5def2133e6a91942ece82cb9c8f2da726460665f9fce223edf6ce954&reaction=dislike'>👎</a>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]