codeant-ai-for-open-source[bot] commented on code in PR #40004:
URL: https://github.com/apache/superset/pull/40004#discussion_r3215791235
##########
superset/utils/pandas_postprocessing/pivot.py:
##########
@@ -27,6 +28,35 @@
)
+def _restore_dropped_metric_columns(
+ df: DataFrame, expected_metrics: list[str]
+) -> DataFrame:
+ """Re-add metric columns that pivot_table dropped due to all-NaN values.
+
+ When drop_missing_columns=True, pandas pivot_table silently removes columns
+ whose entries are all NaN. This breaks downstream post-processing steps
+ (rename, rolling) that use validate_column_args to assert the columns
exist.
+ Restoring the columns as all-NaN preserves the expected schema.
+ """
+ if isinstance(df.columns, pd.MultiIndex):
+ existing_metrics = set(df.columns.get_level_values(0))
+ missing = [m for m in expected_metrics if m not in existing_metrics]
+ if missing:
+ category_values = (
+ df.columns.get_level_values(-1).unique()
+ if len(df.columns) > 0
+ else [None]
+ )
+ for metric in missing:
+ for cat in category_values:
+ df[(metric, cat)] = float("nan")
Review Comment:
**Suggestion:** The restoration logic for MultiIndex pivots only uses the
last column level (`get_level_values(-1)`) and then writes 2-tuples (`(metric,
cat)`), which is incorrect when the pivot has more than one `columns`
dimension. For pivots like `columns=["a","b"]`, this creates
malformed/incomplete column keys (or can raise when tuple depth does not
match), so all-NaN metrics are not restored correctly and downstream operations
can still fail. Build restored keys using the full non-metric tuple for every
existing column combination (all levels after metric), not just the last level.
[logic error]
<details>
<summary><b>Severity Level:</b> Major ⚠️</summary>
```mdx
- ❌ Multi-column pivot post-processing can crash during pivot().
- ❌ All-NaN metrics not restored for multi-level pivots.
- ⚠️ Mixed-timeseries charts using multi-column pivots may fail.
- ⚠️ Rolling/rename post-processing fails for affected metrics.
```
</details>
<details>
<summary><b>Steps of Reproduction ✅ </b></summary>
```mdx
1. Configure a query with post-processing in
`superset/common/query_object.py:500-534` so
that `exec_post_processing()` includes a pivot operation with `operation:
"pivot"` and
`options` containing `columns=["a", "b"]`, `drop_missing_columns=True`
(default), and
multiple metrics (e.g. `"metric": {"operator": "mean"}` and `"metric2":
{"operator":
"mean"}`).
2. Ensure the database result feeding this query contains multiple grouping
columns `a`
and `b` and that, for at least one metric (e.g. `"metric"`), all values are
NULL/NaN for
every row; this matches the all-NaN metric scenario already covered for
single-column
pivots in
`tests/unit_tests/pandas_postprocessing/test_pivot.py:test_pivot_preserves_all_nan_metric_flat`
(lines 70-92), but extended here to `columns=["a", "b"]` as in
`test_pivot_eliminate_cartesian_product_columns` (lines 16-66).
3. When `exec_post_processing()` calls `pandas_postprocessing.pivot()`
(implemented in
`superset/utils/pandas_postprocessing/pivot.py:60-142`) with `columns=["a",
"b"]`,
`df.pivot_table(...)` (lines 122-131) produces a MultiIndex on `df.columns`
where level 0
is the metric name and subsequent levels correspond to each `columns` entry
(`a`, then
`b`). Because one metric is entirely NaN and
`dropna=drop_missing_columns=True`, pandas
drops that metric's columns completely.
4. `_restore_dropped_metric_columns()` (lines 31-57 in `pivot.py`) then runs
for this
MultiIndex DataFrame. In the MultiIndex branch (lines 41-52), it collects
only the last
column level via `df.columns.get_level_values(-1).unique()` (line 45) and,
for each
missing metric, assigns new columns using 2‑tuples `df[(metric, cat)] =
float("nan")`
(line 52). For a MultiIndex with more than two levels (metric + `a` + `b`),
this key has
the wrong depth: on current pandas versions this either (a) raises a pandas
error when
attempting to insert a column with a tuple shorter than
`df.columns.nlevels`, or (b)
creates malformed/incomplete column labels that don't match the original
`(metric, a, b)`
shape. In either case, the all-NaN metric is not properly restored for every
`(a, b)`
combination, so subsequent post-processing such as `rename` and `rolling`
(both decorated
with `validate_column_args` in
`superset/utils/pandas_postprocessing/utils.py:114-131`)
can still fail with `InvalidPostProcessingError("Referenced columns not
available in
DataFrame.")` when they reference that metric.
```
</details>
[Fix in
Cursor](https://app.codeant.ai/fix-in-ide?tool=cursor&prompt=This%20is%20a%20comment%20left%20during%20a%20code%20review.%0A%0A%2A%2APath%3A%2A%2A%20superset%2Futils%2Fpandas_postprocessing%2Fpivot.py%0A%2A%2ALine%3A%2A%2A%2045%3A52%0A%2A%2AComment%3A%2A%2A%0A%09%2ALogic%20Error%3A%20The%20restoration%20logic%20for%20MultiIndex%20pivots%20only%20uses%20the%20last%20column%20level%20%28%60get_level_values%28-1%29%60%29%20and%20then%20writes%202-tuples%20%28%60%28metric%2C%20cat%29%60%29%2C%20which%20is%20incorrect%20when%20the%20pivot%20has%20more%20than%20one%20%60columns%60%20dimension.%20For%20pivots%20like%20%60columns%3D%5B%22a%22%2C%22b%22%5D%60%2C%20this%20creates%20malformed%2Fincomplete%20column%20keys%20%28or%20can%20raise%20when%20tuple%20depth%20does%20not%20match%29%2C%20so%20all-NaN%20metrics%20are%20not%20restored%20correctly%20and%20downstream%20operations%20can%20still%20fail.%20Build%20restored%20keys%20using%20the%20full%20non-metric%20tuple%20for%20every
%20existing%20column%20combination%20%28all%20levels%20after%20metric%29%2C%20not%20just%20the%20last%20level.%0A%0AValidate%20the%20correctness%20of%20the%20flagged%20issue.%20If%20correct%2C%20How%20can%20I%20resolve%20this%3F%20If%20you%20propose%20a%20fix%2C%20implement%20it%20and%20please%20make%20it%20concise.%0AOnce%20fix%20is%20implemented%2C%20also%20check%20other%20comments%20on%20the%20same%20PR%2C%20and%20ask%20user%20if%20the%20user%20wants%20to%20fix%20the%20rest%20of%20the%20comments%20as%20well.%20if%20said%20yes%2C%20then%20fetch%20all%20the%20comments%20validate%20the%20correctness%20and%20implement%20a%20minimal%20fix%0A)
| [Fix in VSCode
Claude](https://app.codeant.ai/fix-in-ide?tool=vscode-claude&prompt=This%20is%20a%20comment%20left%20during%20a%20code%20review.%0A%0A%2A%2APath%3A%2A%2A%20superset%2Futils%2Fpandas_postprocessing%2Fpivot.py%0A%2A%2ALine%3A%2A%2A%2045%3A52%0A%2A%2AComment%3A%2A%2A%0A%09%2ALogic%20Error%3A%20The%20restoration%20logic%20for%20Multi
Index%20pivots%20only%20uses%20the%20last%20column%20level%20%28%60get_level_values%28-1%29%60%29%20and%20then%20writes%202-tuples%20%28%60%28metric%2C%20cat%29%60%29%2C%20which%20is%20incorrect%20when%20the%20pivot%20has%20more%20than%20one%20%60columns%60%20dimension.%20For%20pivots%20like%20%60columns%3D%5B%22a%22%2C%22b%22%5D%60%2C%20this%20creates%20malformed%2Fincomplete%20column%20keys%20%28or%20can%20raise%20when%20tuple%20depth%20does%20not%20match%29%2C%20so%20all-NaN%20metrics%20are%20not%20restored%20correctly%20and%20downstream%20operations%20can%20still%20fail.%20Build%20restored%20keys%20using%20the%20full%20non-metric%20tuple%20for%20every%20existing%20column%20combination%20%28all%20levels%20after%20metric%29%2C%20not%20just%20the%20last%20level.%0A%0AValidate%20the%20correctness%20of%20the%20flagged%20issue.%20If%20correct%2C%20How%20can%20I%20resolve%20this%3F%20If%20you%20propose%20a%20fix%2C%20implement%20it%20and%20please%20make%20it%20concise.%0AOnce%20fix%20i
s%20implemented%2C%20also%20check%20other%20comments%20on%20the%20same%20PR%2C%20and%20ask%20user%20if%20the%20user%20wants%20to%20fix%20the%20rest%20of%20the%20comments%20as%20well.%20if%20said%20yes%2C%20then%20fetch%20all%20the%20comments%20validate%20the%20correctness%20and%20implement%20a%20minimal%20fix%0A)
*(Use Cmd/Ctrl + Click for best experience)*
<details>
<summary><b>Prompt for AI Agent 🤖 </b></summary>
```mdx
This is a comment left during a code review.
**Path:** superset/utils/pandas_postprocessing/pivot.py
**Line:** 45:52
**Comment:**
*Logic Error: The restoration logic for MultiIndex pivots only uses the
last column level (`get_level_values(-1)`) and then writes 2-tuples (`(metric,
cat)`), which is incorrect when the pivot has more than one `columns`
dimension. For pivots like `columns=["a","b"]`, this creates
malformed/incomplete column keys (or can raise when tuple depth does not
match), so all-NaN metrics are not restored correctly and downstream operations
can still fail. Build restored keys using the full non-metric tuple for every
existing column combination (all levels after metric), not just the last level.
Validate the correctness of the flagged issue. If correct, How can I resolve
this? If you propose a fix, implement it and please make it concise.
Once fix is implemented, also check other comments on the same PR, and ask
user if the user wants to fix the rest of the comments as well. if said yes,
then fetch all the comments validate the correctness and implement a minimal fix
```
</details>
<a
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40004&comment_hash=36f12915660f0454af59c262ef1b8c1abf812e601c4933ace680a80bc546d426&reaction=like'>👍</a>
| <a
href='https://app.codeant.ai/feedback?pr_url=https%3A%2F%2Fgithub.com%2Fapache%2Fsuperset%2Fpull%2F40004&comment_hash=36f12915660f0454af59c262ef1b8c1abf812e601c4933ace680a80bc546d426&reaction=dislike'>👎</a>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]