durgaprasadml commented on PR #40935:
URL: https://github.com/apache/superset/pull/40935#issuecomment-4756742511
> **Code Review — PR #40935** HEAD SHA:
`79fc88284963841f0122f23a1bc1d0a65616e622`
>
> ### Scan Coverage (changed Python files)
> Scan Result
> Inline imports 1 instance — new `UpdateDatasetCommand` import inside
nested `if` block (line 88), follows existing file convention
> `any` types 0 violations in new code
> SQL injection / expression concatenation 0 new paths; `expression`
fields go through `UpdateDatasetCommand.validate()` /
`validate_stored_expression`
> RBAC decorator `@tool(tags=["mutate"],
class_permission_name="Dataset", method_permission_name="write")` — correct
> Error handling completeness 2 orphan paths (see B1, H2 below)
> `f-string` / string formatting Clean conversion from `%`-formatting;
no regressions
> Pydantic model typing New models typed correctly; 1 style
inconsistency (see M3)
> Test behavioral coverage Missing failure-path and asymmetric tests (see
M2)
> Remaining 12 scans 0 violations
> ### BLOCKER
> **B1 — Orphan dataset on update failure (`create_virtual_dataset.py`)**
>
> The two-step create-then-update flow at lines 85–116 has no compensating
rollback. `CreateDatasetCommand.run()` runs inside its own `@transaction`
decorator and commits immediately. If `UpdateDatasetCommand` subsequently
fails, the caller receives `id=None` and an error message, but the dataset
already exists in the database.
>
> **Path A** — `except (DatasetCreateFailedError, DatasetUpdateFailedError)`
at line 151: When `DatasetUpdateFailedError` is raised, the response returns
`id=None` (line 153). The caller cannot distinguish "dataset was never created"
from "dataset was created but metric/column update failed." On retry,
`CreateDatasetCommand` raises `DatasetExistsValidationError` because the name
is taken, permanently blocking clean creation with that name.
>
> **Path B** — `except DatasetInvalidError` at line 139:
`UpdateDatasetCommand.validate()` can also raise `DatasetInvalidError` (e.g.,
invalid metric SQL expression that fails `validate_stored_expression`). This is
caught by the same handler that covers SQL parse errors from the create phase.
The response says "Virtual dataset validation failed" — indistinguishable from
a SQL error — but the dataset IS persisted. The caller has no ID to clean up
with.
>
> `CreateDatasetCommand` does not appear to accept `metrics`/`columns`
directly (the command only processes `database`, `table_name`, `sql`, `schema`,
`catalog`, `description`, `owners`), so a single-command path may not be
available. Suggested fixes:
>
> 1. Split the except block: catch `DatasetUpdateFailedError` separately and
return the created `id` alongside the error so the caller can clean up or
accept the partial result.
> 2. Wrap both commands in a manual `db.session` transaction scope and roll
back on update failure.
> 3. On `UpdateDatasetCommand` failure, call
`DeleteDatasetCommand(dataset.id).run()` before returning the error response.
>
> Corroborates codeant findings #4 (id 3398301994, line 163) and #5 (id
3398312067, line 118), both still open. Path B (via `DatasetInvalidError`) is
not covered by those comments.
>
> ### HIGH
> **H1 — Wrong error string for update failures
(`create_virtual_dataset.py:160`)**
>
> ```python
> error=f"Failed to create dataset: {exc}",
> ```
>
> This fires for both `DatasetCreateFailedError` and
`DatasetUpdateFailedError`. An LLM caller reading this message will incorrectly
conclude the dataset was never created and will retry — producing the
name-collision error from Path A above. At minimum, split the exception
handlers or embed the phase in the message:
>
> ```python
> error=f"Dataset was created (id={dataset.id}) but metric/column update
failed: {exc}",
> ```
>
> Corroborates codeant finding #8 (id 3433606079, line 153), still open.
>
> **H2 — `to_dict()` format compatibility with `UpdateDatasetCommand` input
(`create_virtual_dataset.py:94,104`)**
>
> ```python
> existing_metrics = [m.to_dict() for m in dataset.metrics] # line 94
> existing_cols = [c.to_dict() for c in dataset.columns] # line 104
> ```
>
> `SqlaTable.metrics` and `SqlaTable.columns` are SQLAlchemy ORM objects.
`to_dict()` on them returns dicts that include internal DB fields (`id`,
`created_on`, `changed_on`, relationship back-references, etc.).
`UpdateDatasetCommand.validate()` checks for duplicate metric names and column
names against the existing model. Passing back metrics or columns that include
their ORM `id` fields may cause the validation layer to treat them as
update-in-place (correct) or raise `DatasetColumnsExistsValidationError` /
`DatasetMetricsExistsValidationError` depending on how the validator resolves
identity.
>
> For a freshly created dataset with no user-defined metrics,
`dataset.metrics == []` and this is benign. But `dataset.columns` contains
SQL-inferred columns immediately after `CreateDatasetCommand.run()`, so the
`to_dict()` path IS exercised whenever `calculated_columns` is provided. Verify
that passing inferred columns with their `id` fields in the update payload does
not cause spurious duplicate-detection errors. If it does, filter out `id`
before building `existing_cols`.
>
> ### MEDIUM
> **M1 — Column relationship untouched when only `metrics` is updated
(`create_virtual_dataset.py:87–116`)**
>
> When `request.metrics` is set and `request.calculated_columns` is `None`,
`update_props` only contains the `"metrics"` key. `UpdateDatasetCommand` is
invoked without a `"columns"` key. Whether `DatasetDAO.update` with
`attributes={"metrics": [...]}` leaves the `columns` relationship intact
depends on the DAO implementation. If the DAO unconditionally processes all
known relationship keys (clearing absent ones), SQL-inferred columns are
silently deleted. Verify and add a test for the metrics-only path.
>
> **M2 — Missing test coverage for failure and asymmetric paths
(`test_dataset_tools.py`)**
>
> `test_create_virtual_dataset_with_metrics_and_columns` covers the success
case only, with both fields set. Missing tests:
>
> * `UpdateDatasetCommand` raises `DatasetUpdateFailedError` — documents the
orphan-ID behavior for B1/Path A
> * `UpdateDatasetCommand` raises `DatasetInvalidError` — documents the
wrong-handler path for B1/Path B
> * Only `metrics` provided, `calculated_columns` omitted — verifies columns
relationship is untouched (M1)
> * Only `calculated_columns` provided, `metrics` omitted — symmetric check
>
> **M3 — `List[...]` vs `list[...]` inconsistency (`schemas.py:537,542`)**
>
> ```python
> metrics: List[CreateDatasetMetric] | None = Field(...) # line 537
> calculated_columns: List[CreateDatasetCalculatedColumn] | None # line 542
> ```
>
> The file imports `List` from `typing` (line 25) and uses it here, while
the rest of the file uses the Python 3.9+ lowercase `list[...]` syntax (e.g.,
`list[str]`). Change to `list[CreateDatasetMetric] | None` and
`list[CreateDatasetCalculatedColumn] | None`. Remove `List` from the `from
typing import ...` line if it becomes unused after this change.
>
> **M4 — `security-events: write` permission added without explanation
(`.github/workflows/github-action-validator.yml:14`)**
>
> Adding `security-events: write` grants the workflow permission to upload
SARIF results to the repository Security tab. The PR description does not
mention this change or identify which step requires it. If this is for a newly
added CodeQL/Trivy scan, add an inline comment. If it is unrelated to this PR's
intent, move it to a dedicated CI PR so the security permission grant is
visible on its own.
>
> ### NIT
> **N1 — Redundant `getattr` defensive guards
(`create_virtual_dataset.py:93–96,103–106`)**
>
> `SqlaTable.metrics` and `SqlaTable.columns` are declared SQLAlchemy
relationships and always exist on the object. `getattr(dataset, "metrics",
None)` is equivalent to `dataset.metrics`. The truthiness check (`if
getattr(dataset, "metrics", None)`) also evaluates identically to `if
dataset.metrics`, which itself is unnecessary since `[m.to_dict() for m in []]`
already returns `[]`. Simplify to:
>
> ```python
> existing_metrics = [m.to_dict() for m in dataset.metrics]
> existing_cols = [c.to_dict() for c in dataset.columns]
> ```
>
> **N2 — `UpdateDatasetCommand` import placed inside nested `if` block
(`create_virtual_dataset.py:88`)**
>
> The other three imports in this function are at the first level inside
`try`. `UpdateDatasetCommand` is additionally nested inside `with
event_logger.log_context(...)` and `if request.metrics or
request.calculated_columns:`. Move it to the top of the `try` block alongside
the other lazy imports.
>
> **N3 — Missing class docstrings on new Pydantic models
(`schemas.py:413,423`)**
>
> `CreateDatasetMetric` (line 413) and `CreateDatasetCalculatedColumn` (line
423) have no class-level docstrings. Corroborates codeant findings #6 (id
3433547167) and #7 (id 3433547174), both open.
>
> ### Codeant finding status
> Comment ID Summary Status
> 3398301994 Orphan dataset on update failure **OPEN** — corroborated
by B1/Path A
> 3398312067 Orphan/retry collision on two-step **OPEN** — corroborated
by B1/Path A
> 3433606079 Wrong error string for update failures **OPEN** — corroborated
by H1
> 3433547167 Missing docstring on `CreateDatasetMetric` **OPEN** —
corroborated by N3
> 3433547174 Missing docstring on `CreateDatasetCalculatedColumn`
**OPEN** — corroborated by N3
> 3389475131 UpdateDatasetCommand replaces not appends **NOT VALID** —
code explicitly merges `dataset.metrics` with new metrics at lines 93–100
> 3389489799 `metrics`/`calculated_columns` accept arbitrary dicts without
validation **NOT VALID** — `CreateDatasetMetric` and
`CreateDatasetCalculatedColumn` are strongly-typed Pydantic models
> 3389489805 Passing only new metrics deletes existing ones **NOT VALID** —
code merges existing metrics before calling UpdateDatasetCommand
> ### Non-MCP changes
> * `create_chart_guided.py` / `get_chart_type_schema.py`: The
`sql_expression` + `label` guidance and example are correct and consistent with
the Explore UI adhoc metric structure. The `saved_metric: true` JSON uses
lowercase `true`. No issues.
> * `setup-docker/action.yml`: Adding `&& inputs.dockerhub-user != '' &&
inputs.dockerhub-token != ''` to the login condition is a defensive
improvement. No issues.
Thanks for the detailed review.
The orphaned dataset scenario on update failure is a good catch. My original
implementation assumed the create/update sequence would be treated as a single
logical operation from the caller perspective, but since CreateDatasetCommand
commits independently, a failed update can indeed leave behind a partially
configured dataset.
I’ll review the available options (cleanup on failure vs. surfacing the
created dataset id) and align the implementation with existing dataset command
patterns before making changes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]