Re: [PR] fix(dataset-api): disambiguate get_or_create by schema [superset]

via GitHub Thu, 28 May 2026 02:58:40 -0700


codeant-ai-for-open-source[bot] commented on code in PR #40494:
URL: https://github.com/apache/superset/pull/40494#discussion_r3316932493



##########
superset/daos/dataset.py:
##########
@@ -424,6 +424,18 @@ def get_table_by_name(database_id: int, table_name: str) 
-> SqlaTable | None:
             .one_or_none()
         )
 
+    @staticmethod
+    def get_table_by_schema_and_name(
+        database_id: int, schema: str | None, table_name: str
+    ) -> SqlaTable | None:
+        # Filter by schema as well so callers can disambiguate datasets that
+        # share a ``table_name`` across schemas (#30377).
+        return (
+            db.session.query(SqlaTable)
+            .filter_by(database_id=database_id, schema=schema, 
table_name=table_name)
+            .one_or_none()

Review Comment:
   **Suggestion:** The new lookup still ignores `catalog`, even though 
`GetOrCreateDatasetSchema` accepts it and dataset uniqueness includes 
`(database_id, catalog, schema, table_name)`. In multi-catalog databases this 
can still return the wrong dataset or raise `MultipleResultsFound` when two 
datasets share schema/table across catalogs. Include `catalog` in both method 
arguments and the filter so the existence check matches the API contract and 
uniqueness model. [api mismatch]
   
   <details>
   <summary><b>Severity Level:</b> Critical 🚨</summary>
   
   ```mdx
   - ❌ /dataset/get_or_create crashes with MultipleResultsFound.
   - ❌ Wrong dataset returned when catalogs differ but schema matches.
   - ⚠️ Multi-catalog environments misroute datasets across catalogs.
   - ⚠️ API contract misaligns with GetOrCreateDatasetSchema.catalog.
   ```
   </details>
   <details>
   <summary><b>Steps of Reproduction ✅ </b></summary>
   
   ```mdx
   1. Observe the dataset uniqueness model: `SqlaTable` in
   `superset/connectors/sqla/models.py:1256-1258` defines a SQLAlchemy
   `UniqueConstraint("database_id", "catalog", "schema", "table_name")`, 
meaning datasets are
   distinguished by `(database_id, catalog, schema, table_name)`.
   
   2. Create two datasets for the same physical table across different catalogs 
but with the
   same schema and table name using the regular dataset creation API 
(`DatasetRestApi.post`
   at `superset/datasets/api.py:310-360`, which calls `CreateDatasetCommand` at
   `superset/commands/dataset/create.py:42-52` and 
`DatasetDAO.validate_uniqueness` at
   `superset/daos/dataset.py:20-40`) with `catalog="cat_a"` and 
`catalog="cat_b"`, identical
   `schema` and `table_name`, and the same `database`.
   
   3. Call `POST /api/v1/dataset/get_or_create/` which is implemented by
   `DatasetRestApi.get_or_create_dataset` at 
`superset/datasets/api.py:1045-1107` with a JSON
   body matching the existing datasets: `{"database_id": <db_id>, "table_name": 
"<name>",
   "schema": "<schema>", "catalog": "cat_a"}`; the request is validated by
   `GetOrCreateDatasetSchema` (`superset/datasets/schemas.py:360-380`) which 
accepts
   `catalog` but the handler ignores it.
   
   4. Inside `get_or_create_dataset`, the handler computes `schema = 
body.get("schema")` and
   calls `DatasetDAO.get_table_by_schema_and_name(database_id, schema, 
table_name)`
   (`superset/datasets/api.py:1093-1101`), which in turn queries `SqlaTable` in
   `DatasetDAO.get_table_by_schema_and_name` at 
`superset/daos/dataset.py:48-58` using only
   `database_id`, `schema` and `table_name` and omits `catalog`; if two rows 
exist across
   catalogs this query returns multiple rows and `Query.one_or_none()` raises
   `MultipleResultsFound` (HTTP 500), and if only one catalog row exists but 
the client
   intended another catalog, the method still returns that single row, causing a
   false-positive match from the wrong catalog instead of creating the new 
dataset.
   ```
   </details>
   
   [Fix in 
Cursor](https://app.codeant.ai/fix-in-ide?tool=cursor&prompt_id=6d880f17f4514d799b938e1f9790261e&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
 | [Fix in VSCode 
Claude](https://app.codeant.ai/fix-in-ide?tool=vscode-claude&prompt_id=6d880f17f4514d799b938e1f9790261e&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
   
   *(Use Cmd/Ctrl + Click for best experience)*
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/daos/dataset.py
   **Line:** 428:436
   **Comment:**
        *Api Mismatch: The new lookup still ignores `catalog`, even though 
`GetOrCreateDatasetSchema` accepts it and dataset uniqueness includes 
`(database_id, catalog, schema, table_name)`. In multi-catalog databases this 
can still return the wrong dataset or raise `MultipleResultsFound` when two 
datasets share schema/table across catalogs. Include `catalog` in both method 
arguments and the filter so the existence check matches the API contract and 
uniqueness model.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   Once fix is implemented, also check other comments on the same PR, and ask 
user if the user wants to fix the rest of the comments as well. if said yes, 
then fetch all the comments validate the correctness and implement a minimal fix
   ```
   </details>
   <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40494&comment_hash=4d5429879b414ecb2be39f268da37ffe3aa9de0fb4cdbf39e8ec8b051aab51b0&reaction=like'>👍</a>
 | <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40494&comment_hash=4d5429879b414ecb2be39f268da37ffe3aa9de0fb4cdbf39e8ec8b051aab51b0&reaction=dislike'>👎</a>



##########
superset/datasets/api.py:
##########
@@ -1088,7 +1091,14 @@ def get_or_create_dataset(self) -> Response:
             return self.response(400, message=ex.messages)
         table_name = body["table_name"]
         database_id = body["database_id"]
-        if table := DatasetDAO.get_table_by_name(database_id, table_name):
+        # Honour the request's ``schema`` so we don't (a) raise 500 on
+        # ``MultipleResultsFound`` when datasets share a ``table_name`` across
+        # schemas, or (b) silently return an existing dataset from the wrong
+        # schema as a false positive (#30377).
+        schema = body.get("schema")
+        if table := DatasetDAO.get_table_by_schema_and_name(
+            database_id, schema, table_name
+        ):

Review Comment:
   **Suggestion:** `schema` is forwarded as-is from request body, but elsewhere 
Superset treats empty schema and null schema equivalently (`schema or None`). 
With the new strict equality lookup, requests with `schema=""` or missing 
`schema` can miss an existing dataset and incorrectly go down the create path. 
Normalize with `schema = body.get("schema") or None` before lookup so 
empty-string and null schema resolve consistently. [incorrect condition logic]
   
   <details>
   <summary><b>Severity Level:</b> Major ⚠️</summary>
   
   ```mdx
   - ❌ /dataset/get_or_create duplicates datasets for blank-schema tables.
   - ⚠️ Clients using schema="" bypass existing dataset reuse.
   - ⚠️ Inconsistent schema handling versus other codepaths (schema or None).
   ```
   </details>
   <details>
   <summary><b>Steps of Reproduction ✅ </b></summary>
   
   ```mdx
   1. Create a dataset via `POST /api/v1/dataset/` handled by 
`DatasetRestApi.post` at
   `superset/datasets/api.py:310-360`, sending a JSON body without a `schema` 
field (so
   `DatasetPostSchema` at `superset/datasets/schemas.py:159-173` deserializes 
`schema=None`);
   this flows through `CreateDatasetCommand` 
(`superset/commands/dataset/create.py:42-52`)
   and `DatasetDAO.validate_uniqueness` (`superset/daos/dataset.py:20-40`), 
resulting in a
   `SqlaTable` row with `schema=NULL` for that `(database_id, catalog, 
table_name)`.
   
   2. Later, call `POST /api/v1/dataset/get_or_create/`
   (`DatasetRestApi.get_or_create_dataset` at 
`superset/datasets/api.py:1045-1107`) with a
   JSON body that explicitly sets an empty schema string, e.g. `{"database_id": 
<db_id>,
   "table_name": "<name>", "schema": ""}`, which is allowed by 
`GetOrCreateDatasetSchema`
   (`superset/datasets/schemas.py:360-374`, `schema` has `Length(0, 250)` and
   `allow_none=True`).
   
   3. In `get_or_create_dataset`, `body = 
GetOrCreateDatasetSchema().load(request.json)`
   yields `body["schema"] == ""`, and the handler sets `schema = 
body.get("schema")` and
   calls `DatasetDAO.get_table_by_schema_and_name(database_id, schema, 
table_name)`
   (`superset/datasets/api.py:1088-1101`), passing the empty string directly to
   `DatasetDAO.get_table_by_schema_and_name`.
   
   4. `DatasetDAO.get_table_by_schema_and_name` at 
`superset/daos/dataset.py:48-58` executes
   `.filter_by(database_id=database_id, schema=schema, 
table_name=table_name).one_or_none()`,
   which with `schema=""` does not match the existing `SqlaTable` row where 
`schema` is
   `NULL`; `.one_or_none()` returns `None`, `get_or_create_dataset` incorrectly 
goes down the
   create path and calls `CreateDatasetCommand(body).run()` to create a second 
dataset for
   the same physical table, even though elsewhere Superset normalizes schema 
with `schema or
   None` (e.g. `Database.has_table` at `superset/models/core.py:27-33` and
   `SqlaTable.get_datasource_by_name` at 
`superset/connectors/sqla/models.py:1349-1368`), so
   requests differing only by `schema=""` versus missing `schema` are treated 
inconsistently.
   ```
   </details>
   
   [Fix in 
Cursor](https://app.codeant.ai/fix-in-ide?tool=cursor&prompt_id=fbe615aef787401aabb3b32cfa70e26f&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
 | [Fix in VSCode 
Claude](https://app.codeant.ai/fix-in-ide?tool=vscode-claude&prompt_id=fbe615aef787401aabb3b32cfa70e26f&service=github&base_url=https%3A%2F%2Fgithub.com&org=apache&repo=apache%2Fsuperset)
   
   *(Use Cmd/Ctrl + Click for best experience)*
   <details>
   <summary><b>Prompt for AI Agent 🤖 </b></summary>
   
   ```mdx
   This is a comment left during a code review.
   
   **Path:** superset/datasets/api.py
   **Line:** 1098:1101
   **Comment:**
        *Incorrect Condition Logic: `schema` is forwarded as-is from request 
body, but elsewhere Superset treats empty schema and null schema equivalently 
(`schema or None`). With the new strict equality lookup, requests with 
`schema=""` or missing `schema` can miss an existing dataset and incorrectly go 
down the create path. Normalize with `schema = body.get("schema") or None` 
before lookup so empty-string and null schema resolve consistently.
   
   Validate the correctness of the flagged issue. If correct, How can I resolve 
this? If you propose a fix, implement it and please make it concise.
   Once fix is implemented, also check other comments on the same PR, and ask 
user if the user wants to fix the rest of the comments as well. if said yes, 
then fetch all the comments validate the correctness and implement a minimal fix
   ```
   </details>
   <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40494&comment_hash=ae7221eac80b7d79e1d63d665bb634715ff9c2e3b4d2caf68fc3067237a1881e&reaction=like'>👍</a>
 | <a 
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40494&comment_hash=ae7221eac80b7d79e1d63d665bb634715ff9c2e3b4d2caf68fc3067237a1881e&reaction=dislike'>👎</a>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] fix(dataset-api): disambiguate get_or_create by schema [superset]

Reply via email to