codeant-ai-for-open-source[bot] commented on code in PR #40340: URL: https://github.com/apache/superset/pull/40340#discussion_r3306324690
########## superset/mcp_service/dataset/tool/create_dataset.py: ########## @@ -0,0 +1,142 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +""" +Create dataset FastMCP tool + +Registers a physical table as a Superset dataset against an existing +database connection — the programmatic equivalent of Data → Datasets → +Dataset. +Returns the same DatasetInfo shape as get_dataset_info so the caller can feed +the resulting dataset_id directly into generate_chart. +""" + +import logging +from datetime import datetime, timezone + +from fastmcp import Context + +from superset.mcp_service.app import mcp +from superset.mcp_service.auth import mcp_auth_hook +from superset.mcp_service.dataset.schemas import ( + CreateDatasetRequest, + DatasetError, + DatasetInfo, + serialize_dataset_object, +) + +logger = logging.getLogger(__name__) + + [email protected](tags=["mutate"]) +@mcp_auth_hook +def create_dataset( + request: CreateDatasetRequest, ctx: Context +) -> DatasetInfo | DatasetError: + """Register a physical table as a Superset dataset. + + Wraps POST /api/v1/dataset/ — the same endpoint the UI uses when you click + Data → Datasets → +Dataset. Returns full dataset metadata (same shape as + get_dataset_info) so you can pass the resulting dataset_id straight into + generate_chart. + + Required fields: + - database_id: ID of the existing database connection + - schema: Schema/namespace where the table lives (e.g. "public") + - table_name: Exact name of the physical table to register + + Optional fields: + - owners: List of user IDs to set as owners (defaults to calling user) + + Example: + ```json + { + "database_id": 1, + "schema": "public", + "table_name": "orders" + } + ``` + + Returns DatasetInfo on success or DatasetError on failure. + Use list_databases to find the correct database_id. + """ + try: + from superset.commands.dataset.create import CreateDatasetCommand + from superset.commands.dataset.exceptions import ( + DatasetCreateFailedError, + DatasetExistsValidationError, + DatasetInvalidError, + TableNotFoundValidationError, + ) + + dataset_properties = { + "database": request.database_id, + "schema": request.schema, + "table_name": request.table_name, + } Review Comment: **Suggestion:** `schema` is forwarded verbatim without trimming/normalizing whitespace-only values, so inputs like `" "` are treated as literal schema names and trigger false not-found/validation failures. Normalize blank schema input to `None` before building command properties. [incorrect condition logic] <details> <summary><b>Severity Level:</b> Major ⚠️</summary> ```mdx - ⚠️ Whitespace schema inputs cause spurious table-not-found errors. - ⚠️ LLM-generated requests more likely to fail unexpectedly. ``` </details> <details> <summary><b>Steps of Reproduction ✅ </b></summary> ```mdx 1. The `CreateDatasetRequest` model in `superset/mcp_service/dataset/schemas.py:28-52` defines `schema: str` without any `field_validator` to trim or normalize whitespace, unlike the explicit whitespace handling for `sql` and `dataset_name` in `CreateVirtualDatasetRequest` at lines 90-102. 2. A client (or LLM) calls the `create_dataset` MCP tool with a payload where `"schema"` is present but contains only spaces, e.g. `{"request": {"database_id": 1, "schema": " ", "table_name": "orders"}}`; Pydantic accepts `" "` as a valid `str` for `schema`. 3. In `create_dataset`, `dataset_properties` is constructed at `superset/mcp_service/dataset/tool/create_dataset.py:85-89` with `"schema": request.schema`, so the `"schema"` property passed to `CreateDatasetCommand` is the literal `" "` string. 4. `CreateDatasetCommand` uses this schema string to resolve the physical table; because `" "` does not match any real schema, the command raises a not-found or validation error for a table that actually exists in the default/unspecified schema, whereas normalizing blank/whitespace-only schema to `None` would allow correct resolution. ``` </details> [Fix in Cursor](https://app.codeant.ai/fix-in-ide?tool=cursor&prompt_id=3d1eaad1019a46dc82d237cf2cc7df7c&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset) | [Fix in VSCode Claude](https://app.codeant.ai/fix-in-ide?tool=vscode-claude&prompt_id=3d1eaad1019a46dc82d237cf2cc7df7c&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset) *(Use Cmd/Ctrl + Click for best experience)* <details> <summary><b>Prompt for AI Agent 🤖 </b></summary> ```mdx This is a comment left during a code review. **Path:** superset/mcp_service/dataset/tool/create_dataset.py **Line:** 85:89 **Comment:** *Incorrect Condition Logic: `schema` is forwarded verbatim without trimming/normalizing whitespace-only values, so inputs like `" "` are treated as literal schema names and trigger false not-found/validation failures. Normalize blank schema input to `None` before building command properties. Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise. Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix ``` </details> <a href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40340&comment_hash=43aa4d71e3b58c455091b3d1e8650ef381d49fca7248aae2c4aa623f243ea21c&reaction=like'>👍</a> | <a href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40340&comment_hash=43aa4d71e3b58c455091b3d1e8650ef381d49fca7248aae2c4aa623f243ea21c&reaction=dislike'>👎</a> ########## superset/mcp_service/dataset/tool/create_dataset.py: ########## @@ -0,0 +1,142 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +""" +Create dataset FastMCP tool + +Registers a physical table as a Superset dataset against an existing +database connection — the programmatic equivalent of Data → Datasets → +Dataset. +Returns the same DatasetInfo shape as get_dataset_info so the caller can feed +the resulting dataset_id directly into generate_chart. +""" + +import logging +from datetime import datetime, timezone + +from fastmcp import Context + +from superset.mcp_service.app import mcp +from superset.mcp_service.auth import mcp_auth_hook +from superset.mcp_service.dataset.schemas import ( + CreateDatasetRequest, + DatasetError, + DatasetInfo, + serialize_dataset_object, +) + +logger = logging.getLogger(__name__) + + [email protected](tags=["mutate"]) +@mcp_auth_hook +def create_dataset( + request: CreateDatasetRequest, ctx: Context +) -> DatasetInfo | DatasetError: Review Comment: **Suggestion:** This tool is registered with `@mcp.tool` instead of the shared `@tool` decorator that sets `class_permission_name`/`method_permission_name`. Because `mcp_auth_hook` allows tools without class-permission metadata, this write operation can be exposed without RBAC enforcement. Register it with the core MCP decorator and explicit Dataset write permissions. [security] <details> <summary><b>Severity Level:</b> Critical 🚨</summary> ```mdx - ❌ create_dataset tool not gated by Dataset write permission. - ⚠️ RBAC behavior inconsistent with create_virtual_dataset tool. ``` </details> <details> <summary><b>Steps of Reproduction ✅ </b></summary> ```mdx 1. The `create_dataset` tool is registered using `@mcp.tool(tags=["mutate"])` and `@mcp_auth_hook` in `superset/mcp_service/dataset/tool/create_dataset.py:44-48`, instead of the shared `@tool` decorator from `superset_core.mcp.decorators` used by `create_virtual_dataset` in `superset/mcp_service/dataset/tool/create_virtual_dataset.py:33-42`. 2. `mcp_auth_hook` in `superset/mcp_service/auth.py:34-47` documents that permission metadata `(class_permission_name, method_permission_name)` is stored on tool functions by the `@tool` decorator in `core_mcp_injection.py`, and uses `check_tool_permission(tool_func)` for RBAC enforcement when those attributes are present. 3. `check_tool_permission` tests in `tests/unit_tests/mcp_service/test_auth_rbac.py:101-109` assert that when a tool function has no `CLASS_PERMISSION_ATTR` (`_class_permission_name`), `check_tool_permission(func)` returns `True` for a logged-in user, i.e. tools without class-permission metadata are always allowed when RBAC is enabled. 4. Because `create_dataset` is decorated with raw `@mcp.tool` and never passes `class_permission_name`/`method_permission_name` into the core `@tool` decorator, it has no `_class_permission_name` attribute; therefore `mcp_auth_hook`'s call to `check_tool_permission` always allows this mutating dataset tool, unlike `create_virtual_dataset` which is gated by `"Dataset"`/`"write"` permissions, so dataset-creation via MCP is exposed without the intended per-tool RBAC gate. ``` </details> [Fix in Cursor](https://app.codeant.ai/fix-in-ide?tool=cursor&prompt_id=11eeb613817d4ff79e09fefb660d4add&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset) | [Fix in VSCode Claude](https://app.codeant.ai/fix-in-ide?tool=vscode-claude&prompt_id=11eeb613817d4ff79e09fefb660d4add&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset) *(Use Cmd/Ctrl + Click for best experience)* <details> <summary><b>Prompt for AI Agent 🤖 </b></summary> ```mdx This is a comment left during a code review. **Path:** superset/mcp_service/dataset/tool/create_dataset.py **Line:** 44:48 **Comment:** *Security: This tool is registered with `@mcp.tool` instead of the shared `@tool` decorator that sets `class_permission_name`/`method_permission_name`. Because `mcp_auth_hook` allows tools without class-permission metadata, this write operation can be exposed without RBAC enforcement. Register it with the core MCP decorator and explicit Dataset write permissions. Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise. Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix ``` </details> <a href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40340&comment_hash=ed2d18b2ba14ea4f3838664c19c954c6db3a14ab14b5f6b5946fa4100ee31882&reaction=like'>👍</a> | <a href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40340&comment_hash=ed2d18b2ba14ea4f3838664c19c954c6db3a14ab14b5f6b5946fa4100ee31882&reaction=dislike'>👎</a> ########## superset/mcp_service/dataset/schemas.py: ########## @@ -324,6 +324,33 @@ class GetDatasetInfoRequest(MetadataCacheControl): ] +class CreateDatasetRequest(BaseModel): + """Request schema for create_dataset to register a physical table as a dataset.""" + + database_id: Annotated[ + int, + Field( + description="ID of the database connection to register the table against" + ), + ] + schema: Annotated[ + str, + Field(description="Schema (namespace) where the table lives, e.g. 'public'"), + ] Review Comment: **Suggestion:** Making `schema` mandatory rejects valid physical datasets for engines/tables that live without a schema namespace, while the underlying dataset API accepts `schema=None`. This should be optional to avoid request-level validation failures for no-schema sources. [api mismatch] <details> <summary><b>Severity Level:</b> Major ⚠️</summary> ```mdx - ❌ create_dataset unusable for engines without schema namespaces. - ⚠️ LLM calls fail at request validation layer. ``` </details> <details> <summary><b>Steps of Reproduction ✅ </b></summary> ```mdx 1. The MCP tool `create_dataset` in `superset/mcp_service/dataset/tool/create_dataset.py:46-75` declares its first argument as `request: CreateDatasetRequest`, so FastMCP will instantiate `CreateDatasetRequest` from the incoming JSON payload. 2. `CreateDatasetRequest` is defined in `superset/mcp_service/dataset/schemas.py:327-352` with `schema: Annotated[str, Field(...)]` and no default value, making the `schema` field mandatory and non-nullable. 3. Call the MCP tool via FastMCP (analogous to tests in `tests/unit_tests/mcp_service/dataset/tool/test_dataset_tools.py:104-117`) with a payload omitting `schema`, e.g. `{"request": {"database_id": 1, "table_name": "my_table"}}` for a database/engine that legitimately uses no schema namespace. 4. Pydantic validation for `CreateDatasetRequest` rejects the request before `create_dataset()` runs, raising a validation error instead of allowing `schema=None` to be forwarded to `CreateDatasetCommand`, so valid no-schema physical tables cannot be registered through this tool. ``` </details> [Fix in Cursor](https://app.codeant.ai/fix-in-ide?tool=cursor&prompt_id=8488a5bb3772451ebb6a7b72b76fd303&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset) | [Fix in VSCode Claude](https://app.codeant.ai/fix-in-ide?tool=vscode-claude&prompt_id=8488a5bb3772451ebb6a7b72b76fd303&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset) *(Use Cmd/Ctrl + Click for best experience)* <details> <summary><b>Prompt for AI Agent 🤖 </b></summary> ```mdx This is a comment left during a code review. **Path:** superset/mcp_service/dataset/schemas.py **Line:** 336:339 **Comment:** *Api Mismatch: Making `schema` mandatory rejects valid physical datasets for engines/tables that live without a schema namespace, while the underlying dataset API accepts `schema=None`. This should be optional to avoid request-level validation failures for no-schema sources. Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise. Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix ``` </details> <a href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40340&comment_hash=0dfa3d995f14d04cfd261f026ab0e5c73ae119b008c132f468e5a3432aad57da&reaction=like'>👍</a> | <a href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40340&comment_hash=0dfa3d995f14d04cfd261f026ab0e5c73ae119b008c132f468e5a3432aad57da&reaction=dislike'>👎</a> ########## superset/mcp_service/dataset/tool/create_dataset.py: ########## @@ -0,0 +1,142 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +""" +Create dataset FastMCP tool + +Registers a physical table as a Superset dataset against an existing +database connection — the programmatic equivalent of Data → Datasets → +Dataset. +Returns the same DatasetInfo shape as get_dataset_info so the caller can feed +the resulting dataset_id directly into generate_chart. +""" + +import logging +from datetime import datetime, timezone + +from fastmcp import Context + +from superset.mcp_service.app import mcp +from superset.mcp_service.auth import mcp_auth_hook +from superset.mcp_service.dataset.schemas import ( + CreateDatasetRequest, + DatasetError, + DatasetInfo, + serialize_dataset_object, +) + +logger = logging.getLogger(__name__) + + [email protected](tags=["mutate"]) +@mcp_auth_hook +def create_dataset( + request: CreateDatasetRequest, ctx: Context +) -> DatasetInfo | DatasetError: + """Register a physical table as a Superset dataset. + + Wraps POST /api/v1/dataset/ — the same endpoint the UI uses when you click + Data → Datasets → +Dataset. Returns full dataset metadata (same shape as + get_dataset_info) so you can pass the resulting dataset_id straight into + generate_chart. + + Required fields: + - database_id: ID of the existing database connection + - schema: Schema/namespace where the table lives (e.g. "public") + - table_name: Exact name of the physical table to register + + Optional fields: + - owners: List of user IDs to set as owners (defaults to calling user) + + Example: + ```json + { + "database_id": 1, + "schema": "public", + "table_name": "orders" + } + ``` + + Returns DatasetInfo on success or DatasetError on failure. + Use list_databases to find the correct database_id. + """ + try: + from superset.commands.dataset.create import CreateDatasetCommand + from superset.commands.dataset.exceptions import ( + DatasetCreateFailedError, + DatasetExistsValidationError, + DatasetInvalidError, + TableNotFoundValidationError, + ) + + dataset_properties = { + "database": request.database_id, + "schema": request.schema, + "table_name": request.table_name, + } + if request.owners is not None: + dataset_properties["owners"] = request.owners + + command = CreateDatasetCommand(dataset_properties) + dataset = command.run() + + result = serialize_dataset_object(dataset) + if result is None: + return DatasetError( + error="Dataset was created but could not be serialized", + error_type="SerializationError", + timestamp=datetime.now(timezone.utc), + ) + + logger.info( + "Created dataset id=%s table=%s.%s", + dataset.id, + request.schema, + request.table_name, + ) + return result + + except DatasetExistsValidationError as e: + return DatasetError( + error=str(e), + error_type="DatasetExistsError", + timestamp=datetime.now(timezone.utc), + ) + except TableNotFoundValidationError as e: + return DatasetError( + error=str(e), + error_type="TableNotFoundError", + timestamp=datetime.now(timezone.utc), + ) + except DatasetInvalidError as e: + return DatasetError( + error=str(e), + error_type="ValidationError", + timestamp=datetime.now(timezone.utc), Review Comment: **Suggestion:** `CreateDatasetCommand` aggregates validation failures into `DatasetInvalidError`, so this handler collapses specific failure classes (database missing, duplicate dataset, table missing, etc.) into a generic type and breaks the advertised error contract. Classify `DatasetInvalidError` by inspecting its contained validation class names and map to specific `error_type` values. [api mismatch] <details> <summary><b>Severity Level:</b> Major ⚠️</summary> ```mdx - ⚠️ Callers cannot distinguish database-missing vs table-missing errors. - ⚠️ Error_type field loses promised semantic specificity. ``` </details> <details> <summary><b>Steps of Reproduction ✅ </b></summary> ```mdx 1. The `create_dataset` tool in `superset/mcp_service/dataset/tool/create_dataset.py:76-95` constructs `dataset_properties` and calls `command = CreateDatasetCommand(dataset_properties); dataset = command.run()`. 2. When dataset creation fails due to validation issues (e.g. database id not found, table not accessible, duplicate dataset), `CreateDatasetCommand.run()` raises `DatasetInvalidError` carrying specific validation error types, as exercised for the same command in `tests/unit_tests/mcp_service/dataset/tool/test_dataset_tools.py:128-154` and :220-251 for the virtual-dataset tool. 3. In `create_dataset`, any `DatasetInvalidError` is caught by the generic handler at `superset/mcp_service/dataset/tool/create_dataset.py:124-128`, which unconditionally returns `DatasetError(error=str(e), error_type="ValidationError", ...)` without inspecting the contained validation classes. 4. Callers of the MCP `create_dataset` tool therefore always see `error_type="ValidationError"` for all validation failures (database missing, dataset exists, access denied, etc.), instead of more specific categories (e.g. `DatabaseNotFoundError`, `DatasetExistsError`, `AccessDeniedError`) that can be derived from `DatasetInvalidError`'s internal error list, breaking any consumer logic that relies on error_type for branching. ``` </details> [Fix in Cursor](https://app.codeant.ai/fix-in-ide?tool=cursor&prompt_id=cef9b0dee5144077a42222ee775a710a&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset) | [Fix in VSCode Claude](https://app.codeant.ai/fix-in-ide?tool=vscode-claude&prompt_id=cef9b0dee5144077a42222ee775a710a&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset) *(Use Cmd/Ctrl + Click for best experience)* <details> <summary><b>Prompt for AI Agent 🤖 </b></summary> ```mdx This is a comment left during a code review. **Path:** superset/mcp_service/dataset/tool/create_dataset.py **Line:** 124:128 **Comment:** *Api Mismatch: `CreateDatasetCommand` aggregates validation failures into `DatasetInvalidError`, so this handler collapses specific failure classes (database missing, duplicate dataset, table missing, etc.) into a generic type and breaks the advertised error contract. Classify `DatasetInvalidError` by inspecting its contained validation class names and map to specific `error_type` values. Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise. Once fix is implemented, also check other comments on the same PR, and ask user if the user wants to fix the rest of the comments as well. if said yes, then fetch all the comments validate the correctness and implement a minimal fix ``` </details> <a href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40340&comment_hash=0eeedb90dab16b436a7924afcd3e4d645af05b827d0be4c9d7e36610e2d9e559&reaction=like'>👍</a> | <a href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40340&comment_hash=0eeedb90dab16b436a7924afcd3e4d645af05b827d0be4c9d7e36610e2d9e559&reaction=dislike'>👎</a> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
