codeant-ai-for-open-source[bot] commented on code in PR #37557:
URL: https://github.com/apache/superset/pull/37557#discussion_r2744246258
##########
superset/examples/generic_loader.py:
##########
@@ -34,6 +34,45 @@
logger = logging.getLogger(__name__)
+def _find_dataset(
+ table_name: str,
+ database_id: int,
+ uuid: Optional[str] = None,
+ schema: Optional[str] = None,
+) -> tuple[Optional[SqlaTable], bool]:
+ """Find a dataset by UUID first, then fall back to table_name +
database_id.
+
+ Includes schema in the fallback lookup to prevent cross-schema collisions.
+
+ This avoids unique constraint violations when a duplicate row exists.
+
+ Args:
+ table_name: The table name to look up
+ database_id: The database ID
+ uuid: Optional UUID to look up first
+ schema: Optional schema to include in fallback lookup
+
+ Returns:
+ A tuple of (dataset or None, found_by_uuid bool)
+ """
+ tbl = None
+ found_by_uuid = False
+
+ if uuid:
+ tbl = db.session.query(SqlaTable).filter_by(uuid=uuid).first()
+ if tbl:
+ found_by_uuid = True
+
+ if not tbl:
+ tbl = (
+ db.session.query(SqlaTable)
+ .filter_by(table_name=table_name, database_id=database_id,
schema=schema)
Review Comment:
**Suggestion:** The `_find_dataset` fallback query filters by
`schema=schema`, so when an existing dataset row has `schema=None` but the
caller passes a non-null schema (the common backfill case), the row is not
found; this prevents UUID/schema backfill from ever running and instead leads
to creating a new dataset row for the same physical table, causing duplicates
and leaving the original broken row in place. [logic error]
<details>
<summary><b>Severity Level:</b> Critical 🚨</summary>
```mdx
- ❌ Duplicate dataset rows created during examples import.
- ❌ Examples load creating new datasets instead of backfilling UUIDs.
- ⚠️ Backfill of legacy datasets (schema/UUID) fails silently.
- ⚠️ Example re-run produces dataset lookup collisions.
```
</details>
```suggestion
# First try to match on the exact schema for cross-schema safety
tbl = (
db.session.query(SqlaTable)
.filter_by(table_name=table_name, database_id=database_id,
schema=schema)
.first()
)
# If nothing matches the requested schema, fall back to rows with no
schema
# so we can backfill schema/UUID on legacy datasets that have
schema=None.
if not tbl and schema is not None:
tbl = (
db.session.query(SqlaTable)
.filter_by(table_name=table_name, database_id=database_id,
schema=None)
```
<details>
<summary><b>Steps of Reproduction ✅ </b></summary>
```mdx
1. Discover example loaders: discover_datasets() in
superset/examples/data_loading.py:143-150 calls create_generic_loader(...,
schema=config[\"schema\"], uuid=config.get(\"uuid\")) (file:
superset/examples/data_loading.py, lines 143-150). This yields a loader that
closes over a
non-null schema value when dataset.yaml specifies a schema.
2. Invoke the loader created by create_generic_loader: create_generic_loader
-> loader()
calls load_parquet_table(...) with the captured schema and uuid
(superset/examples/generic_loader.py:280-289). See create_generic_loader
definition and
loader invocation (generic_loader.py, lines 272-289).
3. load_parquet_table receives schema!=None and uuid and reaches the dataset
lookup: it
calls _find_dataset(table_name, database.id, uuid, schema)
(generic_loader.py, line 216).
_find_dataset first tries UUID then falls back to a single filter_by that
includes
schema=schema (generic_loader.py, lines 61-71 and 216-217).
4. Real-world problematic state: an existing SqlaTable row was created
earlier by the old
data-loading path without setting schema (row.schema is NULL) and uuid is
NULL. In that
case _find_dataset's fallback filter_by(..., schema=schema) will not match
the NULL-schema
row because schema!=NULL, so tbl remains None and found_by_uuid False. This
causes
load_parquet_table to treat the dataset as missing and create a new
SqlaTable row for the
same physical table, leaving the legacy NULL-schema row intact (duplicate
dataset row).
5. Concrete reproduction with current test scaffolding: run the sequence in
tests/unit_tests/examples/generic_loader_test.py using mocks that simulate
(a)
discover_datasets passing schema (data_loading.py:143-150), (b) mock DB
having an existing
row with schema=None and uuid=None, and (c) load_parquet_table being called
with
schema="public" and uuid set. With the existing code path, _find_dataset
will not return
the NULL-schema row (because schema="public" != None), and the loader will
create a new
dataset row instead of backfilling the legacy row. The mock scenarios in
tests demonstrate
related behaviors (see tests around lines 618-646 and 312-345 for backfill
expectations).
6. Why the improved code fixes it: after trying an exact schema match, the
suggested
change falls back to searching rows with schema=None when a schema was
requested (i.e.,
schema is not None). This allows the loader to find legacy rows with NULL
schema so the
code can backfill tbl.schema and/or tbl.uuid and avoid creating duplicate
dataset rows for
the same physical table.
7. Note on intentionality: current code intentionally includes schema in the
lookup to
avoid cross-schema collisions (tests cover schema-distinguishing behavior at
tests lines
471-506). The suggested change preserves exact-schema matching first
(maintaining
cross-schema safety) and only falls back to schema=None when no exact-schema
match exists,
which is consistent with the test expectations and addresses legacy backfill
scenarios.
```
</details>
<details>
<summary><b>Prompt for AI Agent 🤖 </b></summary>
```mdx
This is a comment left during a code review.
**Path:** superset/examples/generic_loader.py
**Line:** 67:69
**Comment:**
*Logic Error: The `_find_dataset` fallback query filters by
`schema=schema`, so when an existing dataset row has `schema=None` but the
caller passes a non-null schema (the common backfill case), the row is not
found; this prevents UUID/schema backfill from ever running and instead leads
to creating a new dataset row for the same physical table, causing duplicates
and leaving the original broken row in place.
Validate the correctness of the flagged issue. If correct, How can I resolve
this? If you propose a fix, implement it and please make it concise.
```
</details>
##########
tests/unit_tests/examples/generic_loader_test.py:
##########
@@ -0,0 +1,731 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements. See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership. The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied. See the License for the
+# specific language governing permissions and limitations
+# under the License.
+"""Tests for the generic_loader module, specifically UUID handling."""
+
+from unittest.mock import MagicMock, patch
+
+import pandas as pd
+
+
+def _setup_database_mocks(
+ mock_get_db: MagicMock, mock_database: MagicMock, has_table: bool = False
+) -> MagicMock:
+ """Helper to set up common database mocks."""
+ mock_database.id = 1
+ mock_database.has_table.return_value = has_table
+ mock_get_db.return_value = mock_database
+
+ mock_engine = MagicMock()
+ mock_inspector = MagicMock()
+ mock_inspector.default_schema_name = "public"
+ mock_database.get_sqla_engine.return_value.__enter__ = MagicMock(
+ return_value=mock_engine
+ )
+ mock_database.get_sqla_engine.return_value.__exit__ =
MagicMock(return_value=False)
+
+ return mock_inspector
+
+
+@patch("superset.examples.generic_loader.db")
+@patch("superset.examples.generic_loader.get_example_database")
+@patch("superset.examples.generic_loader.read_example_data")
+def test_load_parquet_table_sets_uuid_on_new_table(
+ mock_read_data: MagicMock,
+ mock_get_db: MagicMock,
+ mock_db: MagicMock,
+) -> None:
+ """Test that load_parquet_table sets UUID when creating a new SqlaTable."""
+ from superset.examples.generic_loader import load_parquet_table
+
+ mock_database = MagicMock()
+ mock_inspector = _setup_database_mocks(mock_get_db, mock_database,
has_table=False)
+
+ with patch("superset.examples.generic_loader.inspect") as mock_inspect:
+ mock_inspect.return_value = mock_inspector
+
+ # No existing table by UUID or table_name
+
mock_db.session.query.return_value.filter_by.return_value.first.return_value = (
+ None
+ )
+
+ mock_read_data.return_value = pd.DataFrame({"col1": [1, 2, 3]})
+
+ test_uuid = "14f48794-ebfa-4f60-a26a-582c49132f1b"
+
+ result = load_parquet_table(
+ parquet_file="test_data",
+ table_name="test_table",
+ database=mock_database,
+ only_metadata=True,
+ uuid=test_uuid,
+ )
+
+ assert result.uuid == test_uuid
+
+
+@patch("superset.examples.generic_loader.db")
+@patch("superset.examples.generic_loader.get_example_database")
+def test_load_parquet_table_finds_existing_by_uuid_first(
+ mock_get_db: MagicMock,
+ mock_db: MagicMock,
+) -> None:
+ """Test that load_parquet_table looks up by UUID first when provided."""
+ from superset.examples.generic_loader import load_parquet_table
+
+ mock_database = MagicMock()
+ mock_inspector = _setup_database_mocks(mock_get_db, mock_database,
has_table=True)
+
+ with patch("superset.examples.generic_loader.inspect") as mock_inspect:
+ mock_inspect.return_value = mock_inspector
+
+ # Existing table found by UUID
+ test_uuid = "existing-uuid-1234"
+ mock_existing_table = MagicMock()
+ mock_existing_table.uuid = test_uuid
+ mock_existing_table.table_name = "test_table"
+
+ # First call (by uuid) returns the table, second call (by table_name)
not needed
+
mock_db.session.query.return_value.filter_by.return_value.first.return_value = (
+ mock_existing_table
+ )
+
+ result = load_parquet_table(
+ parquet_file="test_data",
+ table_name="test_table",
+ database=mock_database,
+ only_metadata=True,
+ uuid=test_uuid,
+ )
+
+ # Should return the existing table found by UUID
+ assert result.uuid == test_uuid
+ assert result is mock_existing_table
+
+
+@patch("superset.examples.generic_loader.db")
+@patch("superset.examples.generic_loader.get_example_database")
+def test_load_parquet_table_backfills_uuid_on_existing_table(
+ mock_get_db: MagicMock,
+ mock_db: MagicMock,
+) -> None:
+ """Test that existing dataset with uuid=None gets UUID backfilled."""
+ from superset.examples.generic_loader import load_parquet_table
+
+ mock_database = MagicMock()
+ mock_inspector = _setup_database_mocks(mock_get_db, mock_database,
has_table=True)
+
+ with patch("superset.examples.generic_loader.inspect") as mock_inspect:
+ mock_inspect.return_value = mock_inspector
+
+ # Existing table with NO UUID (needs backfill)
+ mock_existing_table = MagicMock()
+ mock_existing_table.uuid = None
+ mock_existing_table.table_name = "test_table"
+
+ # UUID lookup returns None, table_name lookup returns the table
+ def filter_by_side_effect(**kwargs):
+ mock_result = MagicMock()
+ if "uuid" in kwargs:
+ mock_result.first.return_value = None
+ else:
+ mock_result.first.return_value = mock_existing_table
+ return mock_result
+
+ mock_db.session.query.return_value.filter_by.side_effect =
filter_by_side_effect
+
+ new_uuid = "new-uuid-5678"
+
+ result = load_parquet_table(
+ parquet_file="test_data",
+ table_name="test_table",
+ database=mock_database,
+ only_metadata=True,
+ uuid=new_uuid,
+ )
+
+ # UUID should be backfilled
+ assert result.uuid == new_uuid
+
+
+@patch("superset.examples.generic_loader.db")
+@patch("superset.examples.generic_loader.get_example_database")
+def test_load_parquet_table_avoids_uuid_collision(
+ mock_get_db: MagicMock,
+ mock_db: MagicMock,
+) -> None:
+ """Test that finding by UUID doesn't try to re-set UUID (avoids
collision)."""
+ from superset.examples.generic_loader import load_parquet_table
+
+ mock_database = MagicMock()
+ mock_inspector = _setup_database_mocks(mock_get_db, mock_database,
has_table=True)
+
+ with patch("superset.examples.generic_loader.inspect") as mock_inspect:
+ mock_inspect.return_value = mock_inspector
+
+ # Table already has the UUID we're looking for
+ test_uuid = "existing-uuid-1234"
+ mock_existing_table = MagicMock()
+ mock_existing_table.uuid = test_uuid
+
+ # UUID lookup finds the table
+
mock_db.session.query.return_value.filter_by.return_value.first.return_value = (
+ mock_existing_table
+ )
+
+ result = load_parquet_table(
+ parquet_file="test_data",
+ table_name="test_table",
+ database=mock_database,
+ only_metadata=True,
+ uuid=test_uuid,
+ )
+
+ # UUID should remain unchanged (not re-assigned)
+ assert result.uuid == test_uuid
+
+
+@patch("superset.examples.generic_loader.db")
+@patch("superset.examples.generic_loader.get_example_database")
+def test_load_parquet_table_preserves_existing_different_uuid(
+ mock_get_db: MagicMock,
+ mock_db: MagicMock,
+) -> None:
+ """Test that if table has different UUID, we find it by UUID lookup
first."""
+ from superset.examples.generic_loader import load_parquet_table
+
+ mock_database = MagicMock()
+ mock_inspector = _setup_database_mocks(mock_get_db, mock_database,
has_table=True)
+
+ with patch("superset.examples.generic_loader.inspect") as mock_inspect:
+ mock_inspect.return_value = mock_inspector
+
+ # A table exists with the target UUID
+ target_uuid = "target-uuid-1234"
+ mock_uuid_table = MagicMock()
+ mock_uuid_table.uuid = target_uuid
+
+ # UUID lookup finds the UUID-matching table
+
mock_db.session.query.return_value.filter_by.return_value.first.return_value = (
+ mock_uuid_table
+ )
+
+ result = load_parquet_table(
+ parquet_file="test_data",
+ table_name="different_table_name",
+ database=mock_database,
+ only_metadata=True,
+ uuid=target_uuid,
+ )
+
+ # Should return the table found by UUID, not create new one
+ assert result is mock_uuid_table
+ assert result.uuid == target_uuid
+
+
+@patch("superset.examples.generic_loader.db")
+@patch("superset.examples.generic_loader.get_example_database")
+@patch("superset.examples.generic_loader.read_example_data")
+def test_load_parquet_table_works_without_uuid(
+ mock_read_data: MagicMock,
+ mock_get_db: MagicMock,
+ mock_db: MagicMock,
+) -> None:
+ """Test that load_parquet_table still works when no UUID is provided."""
+ from superset.examples.generic_loader import load_parquet_table
+
+ mock_database = MagicMock()
+ mock_inspector = _setup_database_mocks(mock_get_db, mock_database,
has_table=False)
+
+ with patch("superset.examples.generic_loader.inspect") as mock_inspect:
+ mock_inspect.return_value = mock_inspector
+
+ # No existing table
+
mock_db.session.query.return_value.filter_by.return_value.first.return_value = (
+ None
+ )
+
+ mock_read_data.return_value = pd.DataFrame({"col1": [1, 2, 3]})
+
+ result = load_parquet_table(
+ parquet_file="test_data",
+ table_name="test_table",
+ database=mock_database,
+ only_metadata=True,
+ )
+
+ assert result is not None
+ assert result.table_name == "test_table"
+
+
+@patch("superset.examples.generic_loader.db")
+@patch("superset.examples.generic_loader.get_example_database")
+@patch("superset.examples.generic_loader.read_example_data")
+def test_load_parquet_table_sets_schema_on_new_table(
+ mock_read_data: MagicMock,
+ mock_get_db: MagicMock,
+ mock_db: MagicMock,
+) -> None:
+ """Test that load_parquet_table sets schema when creating a new
SqlaTable."""
+ from superset.examples.generic_loader import load_parquet_table
+
+ mock_database = MagicMock()
+ mock_inspector = _setup_database_mocks(mock_get_db, mock_database,
has_table=False)
+
+ with patch("superset.examples.generic_loader.inspect") as mock_inspect:
+ mock_inspect.return_value = mock_inspector
+
+ # No existing table
+
mock_db.session.query.return_value.filter_by.return_value.first.return_value = (
+ None
+ )
+
+ mock_read_data.return_value = pd.DataFrame({"col1": [1, 2, 3]})
+
+ result = load_parquet_table(
+ parquet_file="test_data",
+ table_name="test_table",
+ database=mock_database,
+ only_metadata=True,
+ schema="custom_schema",
+ )
+
+ assert result is not None
+ assert result.schema == "custom_schema"
+
+
+@patch("superset.examples.generic_loader.db")
+@patch("superset.examples.generic_loader.get_example_database")
+def test_load_parquet_table_backfills_schema_on_existing_table(
+ mock_get_db: MagicMock,
+ mock_db: MagicMock,
+) -> None:
+ """Test that existing dataset with schema=None gets schema backfilled."""
+ from superset.examples.generic_loader import load_parquet_table
+
+ mock_database = MagicMock()
+ mock_inspector = _setup_database_mocks(mock_get_db, mock_database,
has_table=True)
+
+ with patch("superset.examples.generic_loader.inspect") as mock_inspect:
+ mock_inspect.return_value = mock_inspector
+
+ # Existing table with NO schema (needs backfill)
+ mock_existing_table = MagicMock()
+ mock_existing_table.uuid = "some-uuid"
+ mock_existing_table.schema = None
+ mock_existing_table.table_name = "test_table"
+
+
mock_db.session.query.return_value.filter_by.return_value.first.return_value = (
+ mock_existing_table
+ )
+
+ result = load_parquet_table(
+ parquet_file="test_data",
+ table_name="test_table",
+ database=mock_database,
+ only_metadata=True,
+ schema="public",
+ )
+
+ # Schema should be backfilled
+ assert result.schema == "public"
+
+
+def test_create_generic_loader_passes_uuid() -> None:
+ """Test that create_generic_loader passes UUID to load_parquet_table."""
+ from superset.examples.generic_loader import create_generic_loader
+
+ test_uuid = "test-uuid-1234"
+ loader = create_generic_loader(
+ parquet_file="test_data",
+ table_name="test_table",
+ uuid=test_uuid,
+ )
+
+ assert loader is not None
+ assert callable(loader)
+ assert loader.__name__ == "load_test_data"
+
Review Comment:
**Suggestion:** The `test_create_generic_loader_passes_uuid` test never
calls the generated loader or inspects `load_parquet_table`'s arguments, so it
will still pass even if the UUID is not actually threaded through, failing to
detect regressions in the behavior it claims to validate. The fix is to patch
`load_parquet_table`, invoke the loader, and assert that `uuid` was passed in
the call. [logic error]
<details>
<summary><b>Severity Level:</b> Major ⚠️</summary>
```mdx
- ⚠️ Unit test suite fails to catch UUID propagation regressions.
- ⚠️ examples loading UUID bug can slip into releases.
- ⚠️ Affects tests under tests/unit_tests/examples.
```
</details>
```suggestion
@patch("superset.examples.generic_loader.load_parquet_table")
def test_create_generic_loader_passes_uuid(
mock_load_parquet: MagicMock,
) -> None:
"""Test that create_generic_loader passes UUID to load_parquet_table."""
from superset.examples.generic_loader import create_generic_loader
test_uuid = "test-uuid-1234"
loader = create_generic_loader(
parquet_file="test_data",
table_name="test_table",
uuid=test_uuid,
)
# Call the loader to trigger load_parquet_table
loader(True, False, False) # type: ignore[call-arg,arg-type]
# Verify UUID was passed through to load_parquet_table
mock_load_parquet.assert_called_once()
_, kwargs = mock_load_parquet.call_args
assert kwargs.get("uuid") == test_uuid
```
<details>
<summary><b>Steps of Reproduction ✅ </b></summary>
```mdx
1. Open tests file tests/unit_tests/examples/generic_loader_test.py and
locate the test
function `test_create_generic_loader_passes_uuid()` (file header shows this
at lines
347-361 in the PR). The current test only constructs the loader and asserts
its name
(lines 347-361).
2. Because the test never calls the returned loader, the internal call into
`load_parquet_table()` is never executed. This means regressions that drop
the uuid
argument from create_generic_loader -> loader -> load_parquet_table won't
fail the test.
3. Reproduce the regression locally by modifying create_generic_loader to
not forward uuid
(e.g., remove uuid kwarg when calling load_parquet_table) then run pytest
tests/unit_tests/examples/generic_loader_test.py::test_create_generic_loader_passes_uuid
-
the current test will still pass because it only asserts the loader's name
(no call).
4. Applying the improved test (patching load_parquet_table and calling
loader) makes the
regression fail: the patched mock will record call args and the assertion on
kwargs["uuid"] will detect missing propagation.
```
</details>
<details>
<summary><b>Prompt for AI Agent 🤖 </b></summary>
```mdx
This is a comment left during a code review.
**Path:** tests/unit_tests/examples/generic_loader_test.py
**Line:** 347:361
**Comment:**
*Logic Error: The `test_create_generic_loader_passes_uuid` test never
calls the generated loader or inspects `load_parquet_table`'s arguments, so it
will still pass even if the UUID is not actually threaded through, failing to
detect regressions in the behavior it claims to validate. The fix is to patch
`load_parquet_table`, invoke the loader, and assert that `uuid` was passed in
the call.
Validate the correctness of the flagged issue. If correct, How can I resolve
this? If you propose a fix, implement it and please make it concise.
```
</details>
##########
tests/unit_tests/examples/generic_loader_test.py:
##########
@@ -0,0 +1,731 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements. See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership. The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied. See the License for the
+# specific language governing permissions and limitations
+# under the License.
+"""Tests for the generic_loader module, specifically UUID handling."""
+
+from unittest.mock import MagicMock, patch
+
+import pandas as pd
+
+
+def _setup_database_mocks(
+ mock_get_db: MagicMock, mock_database: MagicMock, has_table: bool = False
+) -> MagicMock:
+ """Helper to set up common database mocks."""
+ mock_database.id = 1
+ mock_database.has_table.return_value = has_table
+ mock_get_db.return_value = mock_database
+
+ mock_engine = MagicMock()
+ mock_inspector = MagicMock()
+ mock_inspector.default_schema_name = "public"
+ mock_database.get_sqla_engine.return_value.__enter__ = MagicMock(
+ return_value=mock_engine
+ )
+ mock_database.get_sqla_engine.return_value.__exit__ =
MagicMock(return_value=False)
+
+ return mock_inspector
+
+
+@patch("superset.examples.generic_loader.db")
+@patch("superset.examples.generic_loader.get_example_database")
+@patch("superset.examples.generic_loader.read_example_data")
+def test_load_parquet_table_sets_uuid_on_new_table(
+ mock_read_data: MagicMock,
+ mock_get_db: MagicMock,
+ mock_db: MagicMock,
+) -> None:
+ """Test that load_parquet_table sets UUID when creating a new SqlaTable."""
+ from superset.examples.generic_loader import load_parquet_table
+
+ mock_database = MagicMock()
+ mock_inspector = _setup_database_mocks(mock_get_db, mock_database,
has_table=False)
+
+ with patch("superset.examples.generic_loader.inspect") as mock_inspect:
+ mock_inspect.return_value = mock_inspector
+
+ # No existing table by UUID or table_name
+
mock_db.session.query.return_value.filter_by.return_value.first.return_value = (
+ None
+ )
+
+ mock_read_data.return_value = pd.DataFrame({"col1": [1, 2, 3]})
+
+ test_uuid = "14f48794-ebfa-4f60-a26a-582c49132f1b"
+
+ result = load_parquet_table(
+ parquet_file="test_data",
+ table_name="test_table",
+ database=mock_database,
+ only_metadata=True,
+ uuid=test_uuid,
+ )
+
+ assert result.uuid == test_uuid
+
+
+@patch("superset.examples.generic_loader.db")
+@patch("superset.examples.generic_loader.get_example_database")
+def test_load_parquet_table_finds_existing_by_uuid_first(
+ mock_get_db: MagicMock,
+ mock_db: MagicMock,
+) -> None:
+ """Test that load_parquet_table looks up by UUID first when provided."""
+ from superset.examples.generic_loader import load_parquet_table
+
+ mock_database = MagicMock()
+ mock_inspector = _setup_database_mocks(mock_get_db, mock_database,
has_table=True)
+
+ with patch("superset.examples.generic_loader.inspect") as mock_inspect:
+ mock_inspect.return_value = mock_inspector
+
+ # Existing table found by UUID
+ test_uuid = "existing-uuid-1234"
+ mock_existing_table = MagicMock()
+ mock_existing_table.uuid = test_uuid
+ mock_existing_table.table_name = "test_table"
+
+ # First call (by uuid) returns the table, second call (by table_name)
not needed
+
mock_db.session.query.return_value.filter_by.return_value.first.return_value = (
+ mock_existing_table
+ )
+
+ result = load_parquet_table(
+ parquet_file="test_data",
+ table_name="test_table",
+ database=mock_database,
+ only_metadata=True,
+ uuid=test_uuid,
+ )
+
+ # Should return the existing table found by UUID
+ assert result.uuid == test_uuid
+ assert result is mock_existing_table
+
+
+@patch("superset.examples.generic_loader.db")
+@patch("superset.examples.generic_loader.get_example_database")
+def test_load_parquet_table_backfills_uuid_on_existing_table(
+ mock_get_db: MagicMock,
+ mock_db: MagicMock,
+) -> None:
+ """Test that existing dataset with uuid=None gets UUID backfilled."""
+ from superset.examples.generic_loader import load_parquet_table
+
+ mock_database = MagicMock()
+ mock_inspector = _setup_database_mocks(mock_get_db, mock_database,
has_table=True)
+
+ with patch("superset.examples.generic_loader.inspect") as mock_inspect:
+ mock_inspect.return_value = mock_inspector
+
+ # Existing table with NO UUID (needs backfill)
+ mock_existing_table = MagicMock()
+ mock_existing_table.uuid = None
+ mock_existing_table.table_name = "test_table"
+
+ # UUID lookup returns None, table_name lookup returns the table
+ def filter_by_side_effect(**kwargs):
+ mock_result = MagicMock()
+ if "uuid" in kwargs:
+ mock_result.first.return_value = None
+ else:
+ mock_result.first.return_value = mock_existing_table
+ return mock_result
+
+ mock_db.session.query.return_value.filter_by.side_effect =
filter_by_side_effect
+
+ new_uuid = "new-uuid-5678"
+
+ result = load_parquet_table(
+ parquet_file="test_data",
+ table_name="test_table",
+ database=mock_database,
+ only_metadata=True,
+ uuid=new_uuid,
+ )
+
+ # UUID should be backfilled
+ assert result.uuid == new_uuid
+
+
+@patch("superset.examples.generic_loader.db")
+@patch("superset.examples.generic_loader.get_example_database")
+def test_load_parquet_table_avoids_uuid_collision(
+ mock_get_db: MagicMock,
+ mock_db: MagicMock,
+) -> None:
+ """Test that finding by UUID doesn't try to re-set UUID (avoids
collision)."""
+ from superset.examples.generic_loader import load_parquet_table
+
+ mock_database = MagicMock()
+ mock_inspector = _setup_database_mocks(mock_get_db, mock_database,
has_table=True)
+
+ with patch("superset.examples.generic_loader.inspect") as mock_inspect:
+ mock_inspect.return_value = mock_inspector
+
+ # Table already has the UUID we're looking for
+ test_uuid = "existing-uuid-1234"
+ mock_existing_table = MagicMock()
+ mock_existing_table.uuid = test_uuid
+
+ # UUID lookup finds the table
+
mock_db.session.query.return_value.filter_by.return_value.first.return_value = (
+ mock_existing_table
+ )
+
+ result = load_parquet_table(
+ parquet_file="test_data",
+ table_name="test_table",
+ database=mock_database,
+ only_metadata=True,
+ uuid=test_uuid,
+ )
+
+ # UUID should remain unchanged (not re-assigned)
+ assert result.uuid == test_uuid
+
+
+@patch("superset.examples.generic_loader.db")
+@patch("superset.examples.generic_loader.get_example_database")
+def test_load_parquet_table_preserves_existing_different_uuid(
+ mock_get_db: MagicMock,
+ mock_db: MagicMock,
+) -> None:
+ """Test that if table has different UUID, we find it by UUID lookup
first."""
+ from superset.examples.generic_loader import load_parquet_table
+
+ mock_database = MagicMock()
+ mock_inspector = _setup_database_mocks(mock_get_db, mock_database,
has_table=True)
+
+ with patch("superset.examples.generic_loader.inspect") as mock_inspect:
+ mock_inspect.return_value = mock_inspector
+
+ # A table exists with the target UUID
+ target_uuid = "target-uuid-1234"
+ mock_uuid_table = MagicMock()
+ mock_uuid_table.uuid = target_uuid
+
+ # UUID lookup finds the UUID-matching table
+
mock_db.session.query.return_value.filter_by.return_value.first.return_value = (
+ mock_uuid_table
+ )
+
+ result = load_parquet_table(
+ parquet_file="test_data",
+ table_name="different_table_name",
+ database=mock_database,
+ only_metadata=True,
+ uuid=target_uuid,
+ )
+
+ # Should return the table found by UUID, not create new one
+ assert result is mock_uuid_table
+ assert result.uuid == target_uuid
+
+
+@patch("superset.examples.generic_loader.db")
+@patch("superset.examples.generic_loader.get_example_database")
+@patch("superset.examples.generic_loader.read_example_data")
+def test_load_parquet_table_works_without_uuid(
+ mock_read_data: MagicMock,
+ mock_get_db: MagicMock,
+ mock_db: MagicMock,
+) -> None:
+ """Test that load_parquet_table still works when no UUID is provided."""
+ from superset.examples.generic_loader import load_parquet_table
+
+ mock_database = MagicMock()
+ mock_inspector = _setup_database_mocks(mock_get_db, mock_database,
has_table=False)
+
+ with patch("superset.examples.generic_loader.inspect") as mock_inspect:
+ mock_inspect.return_value = mock_inspector
+
+ # No existing table
+
mock_db.session.query.return_value.filter_by.return_value.first.return_value = (
+ None
+ )
+
+ mock_read_data.return_value = pd.DataFrame({"col1": [1, 2, 3]})
+
+ result = load_parquet_table(
+ parquet_file="test_data",
+ table_name="test_table",
+ database=mock_database,
+ only_metadata=True,
+ )
+
+ assert result is not None
+ assert result.table_name == "test_table"
+
+
+@patch("superset.examples.generic_loader.db")
+@patch("superset.examples.generic_loader.get_example_database")
+@patch("superset.examples.generic_loader.read_example_data")
+def test_load_parquet_table_sets_schema_on_new_table(
+ mock_read_data: MagicMock,
+ mock_get_db: MagicMock,
+ mock_db: MagicMock,
+) -> None:
+ """Test that load_parquet_table sets schema when creating a new
SqlaTable."""
+ from superset.examples.generic_loader import load_parquet_table
+
+ mock_database = MagicMock()
+ mock_inspector = _setup_database_mocks(mock_get_db, mock_database,
has_table=False)
+
+ with patch("superset.examples.generic_loader.inspect") as mock_inspect:
+ mock_inspect.return_value = mock_inspector
+
+ # No existing table
+
mock_db.session.query.return_value.filter_by.return_value.first.return_value = (
+ None
+ )
+
+ mock_read_data.return_value = pd.DataFrame({"col1": [1, 2, 3]})
+
+ result = load_parquet_table(
+ parquet_file="test_data",
+ table_name="test_table",
+ database=mock_database,
+ only_metadata=True,
+ schema="custom_schema",
+ )
+
+ assert result is not None
+ assert result.schema == "custom_schema"
+
+
+@patch("superset.examples.generic_loader.db")
+@patch("superset.examples.generic_loader.get_example_database")
+def test_load_parquet_table_backfills_schema_on_existing_table(
+ mock_get_db: MagicMock,
+ mock_db: MagicMock,
+) -> None:
+ """Test that existing dataset with schema=None gets schema backfilled."""
+ from superset.examples.generic_loader import load_parquet_table
+
+ mock_database = MagicMock()
+ mock_inspector = _setup_database_mocks(mock_get_db, mock_database,
has_table=True)
+
+ with patch("superset.examples.generic_loader.inspect") as mock_inspect:
+ mock_inspect.return_value = mock_inspector
+
+ # Existing table with NO schema (needs backfill)
+ mock_existing_table = MagicMock()
+ mock_existing_table.uuid = "some-uuid"
+ mock_existing_table.schema = None
+ mock_existing_table.table_name = "test_table"
+
+
mock_db.session.query.return_value.filter_by.return_value.first.return_value = (
+ mock_existing_table
+ )
+
+ result = load_parquet_table(
+ parquet_file="test_data",
+ table_name="test_table",
+ database=mock_database,
+ only_metadata=True,
+ schema="public",
+ )
+
+ # Schema should be backfilled
+ assert result.schema == "public"
+
+
+def test_create_generic_loader_passes_uuid() -> None:
+ """Test that create_generic_loader passes UUID to load_parquet_table."""
+ from superset.examples.generic_loader import create_generic_loader
+
+ test_uuid = "test-uuid-1234"
+ loader = create_generic_loader(
+ parquet_file="test_data",
+ table_name="test_table",
+ uuid=test_uuid,
+ )
+
+ assert loader is not None
+ assert callable(loader)
+ assert loader.__name__ == "load_test_data"
+
+
+@patch("superset.examples.generic_loader.db")
+def test_find_dataset_returns_uuid_match_first(mock_db: MagicMock) -> None:
+ """Test that _find_dataset returns UUID match over table_name match."""
+ from superset.examples.generic_loader import _find_dataset
+
+ # Row with UUID (should be found first)
+ uuid_row = MagicMock()
+ uuid_row.uuid = "target-uuid"
+ uuid_row.table_name = "table_a"
+
+ # Row without UUID (same table_name as query)
+ tablename_row = MagicMock()
+ tablename_row.uuid = None
+ tablename_row.table_name = "test_table"
+
+ # UUID lookup returns uuid_row
+
mock_db.session.query.return_value.filter_by.return_value.first.return_value = (
+ uuid_row
+ )
+
+ result, found_by_uuid = _find_dataset("test_table", 1, "target-uuid",
"public")
+
+ assert result is uuid_row
+ assert found_by_uuid is True
+
+
+@patch("superset.examples.generic_loader.db")
+def test_find_dataset_falls_back_to_table_name(mock_db: MagicMock) -> None:
+ """Test that _find_dataset falls back to table_name when UUID not found."""
+ from superset.examples.generic_loader import _find_dataset
+
+ tablename_row = MagicMock()
+ tablename_row.uuid = None
+ tablename_row.table_name = "test_table"
+
+ # UUID lookup returns None, table_name lookup returns the row
+ def filter_by_side_effect(**kwargs):
+ mock_result = MagicMock()
+ if "uuid" in kwargs:
+ mock_result.first.return_value = None
+ else:
+ mock_result.first.return_value = tablename_row
+ return mock_result
+
+ mock_db.session.query.return_value.filter_by.side_effect =
filter_by_side_effect
+
+ result, found_by_uuid = _find_dataset("test_table", 1, "nonexistent-uuid",
"public")
+
+ assert result is tablename_row
+ assert found_by_uuid is False
+
+
+@patch("superset.examples.generic_loader.db")
+@patch("superset.examples.generic_loader.get_example_database")
+@patch("superset.examples.generic_loader.read_example_data")
+def test_load_parquet_table_duplicate_rows_table_missing(
+ mock_read_data: MagicMock,
+ mock_get_db: MagicMock,
+ mock_db: MagicMock,
+) -> None:
+ """Test UUID-first lookup when duplicate rows exist and physical table is
missing.
+
+ Scenario:
+ - Row A: table_name="test_table", uuid="target-uuid" (correct row)
+ - Row B: table_name="test_table", uuid=None (duplicate from broken run)
+ - Physical table was dropped
+
+ The UUID-first lookup should find Row A and avoid collision.
+ """
+ from superset.examples.generic_loader import load_parquet_table
+
+ mock_database = MagicMock()
+ mock_inspector = _setup_database_mocks(
+ mock_get_db,
+ mock_database,
+ has_table=False, # Table was dropped
+ )
+
+ with patch("superset.examples.generic_loader.inspect") as mock_inspect:
+ mock_inspect.return_value = mock_inspector
+
+ # Row with UUID (should be found by UUID lookup)
+ uuid_row = MagicMock()
+ uuid_row.uuid = "target-uuid"
+ uuid_row.table_name = "test_table"
+ uuid_row.database = mock_database
+
+ # UUID lookup finds the correct row
+
mock_db.session.query.return_value.filter_by.return_value.first.return_value = (
+ uuid_row
+ )
+
+ mock_read_data.return_value = pd.DataFrame({"col1": [1, 2, 3]})
+
+ result = load_parquet_table(
+ parquet_file="test_data",
+ table_name="test_table",
+ database=mock_database,
+ only_metadata=True,
+ uuid="target-uuid",
+ )
+
+ # Should return the UUID row, not try to backfill (which would collide)
+ assert result is uuid_row
+ assert result.uuid == "target-uuid"
+
+
+@patch("superset.examples.generic_loader.db")
+def test_find_dataset_distinguishes_schemas(mock_db: MagicMock) -> None:
+ """Test that _find_dataset uses schema to distinguish same-name tables.
+
+ Scenario:
+ - Row A: table_name="users", schema="schema_a", uuid=None
+ - Row B: table_name="users", schema="schema_b", uuid=None
+
+ Looking up "users" in "schema_b" should find Row B, not Row A.
+ """
+ from superset.examples.generic_loader import _find_dataset
+
+ # Row in schema_b (should be found)
+ schema_b_row = MagicMock()
+ schema_b_row.uuid = None
+ schema_b_row.table_name = "users"
+ schema_b_row.schema = "schema_b"
+
+ # No UUID lookup (uuid not provided), table_name lookup returns schema_b
row
+ def filter_by_side_effect(**kwargs):
+ mock_result = MagicMock()
+ if "uuid" in kwargs:
+ mock_result.first.return_value = None
+ elif kwargs.get("schema") == "schema_b":
+ mock_result.first.return_value = schema_b_row
+ else:
+ mock_result.first.return_value = None # schema_a not requested
+ return mock_result
+
+ mock_db.session.query.return_value.filter_by.side_effect =
filter_by_side_effect
+
+ result, found_by_uuid = _find_dataset("users", 1, None, "schema_b")
+
+ assert result is schema_b_row
+ assert found_by_uuid is False
+ assert result.schema == "schema_b"
+
+
+@patch("superset.examples.generic_loader.db")
+def test_find_dataset_no_uuid_no_schema(mock_db: MagicMock) -> None:
+ """Test _find_dataset with no UUID and no schema (basic lookup)."""
+ from superset.examples.generic_loader import _find_dataset
+
+ basic_row = MagicMock()
+ basic_row.uuid = None
+ basic_row.table_name = "test_table"
+ basic_row.schema = None
+
+ # No UUID provided, so skip UUID lookup; table_name+database_id lookup
returns row
+
mock_db.session.query.return_value.filter_by.return_value.first.return_value = (
+ basic_row
+ )
+
+ result, found_by_uuid = _find_dataset("test_table", 1, None, None)
+
+ assert result is basic_row
+ assert found_by_uuid is False
+
+
+@patch("superset.examples.generic_loader.db")
+@patch("superset.examples.generic_loader.get_example_database")
+def test_load_parquet_table_no_backfill_when_uuid_already_set(
+ mock_get_db: MagicMock,
+ mock_db: MagicMock,
+) -> None:
+ """Test that existing UUID is preserved (not overwritten) during
backfill."""
+ from superset.examples.generic_loader import load_parquet_table
+
+ mock_database = MagicMock()
+ mock_inspector = _setup_database_mocks(mock_get_db, mock_database,
has_table=True)
+
+ with patch("superset.examples.generic_loader.inspect") as mock_inspect:
+ mock_inspect.return_value = mock_inspector
+
+ # Existing table already has a UUID set
+ mock_existing_table = MagicMock()
+ mock_existing_table.uuid = "existing-uuid-1234"
+ mock_existing_table.schema = "public"
+ mock_existing_table.table_name = "test_table"
+
+
mock_db.session.query.return_value.filter_by.return_value.first.return_value = (
+ mock_existing_table
+ )
+
+ result = load_parquet_table(
+ parquet_file="test_data",
+ table_name="test_table",
+ database=mock_database,
+ only_metadata=True,
+ uuid="new-uuid-5678", # Try to set a different UUID
+ )
+
+ # Existing UUID should be preserved, not overwritten
+ assert result.uuid == "existing-uuid-1234"
+
+
+@patch("superset.examples.generic_loader.db")
+@patch("superset.examples.generic_loader.get_example_database")
+def test_load_parquet_table_no_backfill_when_schema_already_set(
+ mock_get_db: MagicMock,
+ mock_db: MagicMock,
+) -> None:
+ """Test that existing schema is preserved (not overwritten) during
backfill."""
+ from superset.examples.generic_loader import load_parquet_table
+
+ mock_database = MagicMock()
+ mock_inspector = _setup_database_mocks(mock_get_db, mock_database,
has_table=True)
+
+ with patch("superset.examples.generic_loader.inspect") as mock_inspect:
+ mock_inspect.return_value = mock_inspector
+
+ # Existing table already has a schema set
+ mock_existing_table = MagicMock()
+ mock_existing_table.uuid = "some-uuid"
+ mock_existing_table.schema = "existing_schema"
+ mock_existing_table.table_name = "test_table"
+
+
mock_db.session.query.return_value.filter_by.return_value.first.return_value = (
+ mock_existing_table
+ )
+
+ result = load_parquet_table(
+ parquet_file="test_data",
+ table_name="test_table",
+ database=mock_database,
+ only_metadata=True,
+ schema="new_schema", # Try to set a different schema
+ )
+
+ # Existing schema should be preserved, not overwritten
+ assert result.schema == "existing_schema"
+
+
+@patch("superset.examples.generic_loader.db")
+@patch("superset.examples.generic_loader.get_example_database")
+def test_load_parquet_table_both_uuid_and_schema_backfill(
+ mock_get_db: MagicMock,
+ mock_db: MagicMock,
+) -> None:
+ """Test that both UUID and schema are backfilled in a single call."""
+ from superset.examples.generic_loader import load_parquet_table
+
+ mock_database = MagicMock()
+ mock_inspector = _setup_database_mocks(mock_get_db, mock_database,
has_table=True)
+
+ with patch("superset.examples.generic_loader.inspect") as mock_inspect:
+ mock_inspect.return_value = mock_inspector
+
+ # Existing table with neither UUID nor schema
+ mock_existing_table = MagicMock()
+ mock_existing_table.uuid = None
+ mock_existing_table.schema = None
+ mock_existing_table.table_name = "test_table"
+
+ # UUID lookup returns None, table_name lookup returns the table
+ def filter_by_side_effect(**kwargs):
+ mock_result = MagicMock()
+ if "uuid" in kwargs:
+ mock_result.first.return_value = None
+ else:
+ mock_result.first.return_value = mock_existing_table
+ return mock_result
+
+ mock_db.session.query.return_value.filter_by.side_effect =
filter_by_side_effect
+
+ result = load_parquet_table(
+ parquet_file="test_data",
+ table_name="test_table",
+ database=mock_database,
+ only_metadata=True,
+ uuid="new-uuid",
+ schema="new_schema",
+ )
+
+ # Both should be backfilled
+ assert result.uuid == "new-uuid"
+ assert result.schema == "new_schema"
+
+
+def test_create_generic_loader_passes_schema() -> None:
+ """Test that create_generic_loader passes schema to load_parquet_table."""
+ from superset.examples.generic_loader import create_generic_loader
+
+ test_schema = "custom_schema"
+ loader = create_generic_loader(
+ parquet_file="test_data",
+ table_name="test_table",
+ schema=test_schema,
+ )
+
+ assert loader is not None
+ assert callable(loader)
+ assert loader.__name__ == "load_test_data"
+
Review Comment:
**Suggestion:** The `test_create_generic_loader_passes_schema` test
similarly only inspects the returned loader object and never verifies that
`schema` is actually forwarded into `load_parquet_table`, so it will not fail
if schema propagation is broken; patching `load_parquet_table`, invoking the
loader, and asserting on the `schema` argument fixes this. [logic error]
<details>
<summary><b>Severity Level:</b> Major ⚠️</summary>
```mdx
- ⚠️ Unit test suite misses schema propagation regressions.
- ⚠️ Example dataset schema mismatches at load time.
- ⚠️ Affects tests under tests/unit_tests/examples.
```
</details>
```suggestion
@patch("superset.examples.generic_loader.load_parquet_table")
def test_create_generic_loader_passes_schema(
mock_load_parquet: MagicMock,
) -> None:
"""Test that create_generic_loader passes schema to
load_parquet_table."""
from superset.examples.generic_loader import create_generic_loader
test_schema = "custom_schema"
loader = create_generic_loader(
parquet_file="test_data",
table_name="test_table",
schema=test_schema,
)
# Call the loader to trigger load_parquet_table
loader(True, False, False) # type: ignore[call-arg,arg-type]
# Verify schema was passed through to load_parquet_table
mock_load_parquet.assert_called_once()
_, kwargs = mock_load_parquet.call_args
assert kwargs.get("schema") == test_schema
```
<details>
<summary><b>Steps of Reproduction ✅ </b></summary>
```mdx
1. Open tests/unit_tests/examples/generic_loader_test.py and find
`test_create_generic_loader_passes_schema()` (lines 649-663). The test only
creates the
loader and asserts its name, never exercising the loader call path.
2. Simulate a regression by changing create_generic_loader to drop the
schema kwarg when
constructing the inner loader function. Run pytest
tests/unit_tests/examples/generic_loader_test.py::test_create_generic_loader_passes_schema
— the current test will still pass because it doesn't call the loader.
3. With the improved test (patching load_parquet_table and invoking the
loader), running
the same pytest target will record the mocked call and assert that kwargs
contain
"schema", failing if schema was omitted.
4. This verifies the runtime propagation from create_generic_loader ->
returned loader ->
load_parquet_table, preventing silent regressions.
```
</details>
<details>
<summary><b>Prompt for AI Agent 🤖 </b></summary>
```mdx
This is a comment left during a code review.
**Path:** tests/unit_tests/examples/generic_loader_test.py
**Line:** 649:663
**Comment:**
*Logic Error: The `test_create_generic_loader_passes_schema` test
similarly only inspects the returned loader object and never verifies that
`schema` is actually forwarded into `load_parquet_table`, so it will not fail
if schema propagation is broken; patching `load_parquet_table`, invoking the
loader, and asserting on the `schema` argument fixes this.
Validate the correctness of the flagged issue. If correct, How can I resolve
this? If you propose a fix, implement it and please make it concise.
```
</details>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]