villebro commented on a change in pull request #15975: URL: https://github.com/apache/superset/pull/15975#discussion_r679848692
########## File path: superset/utils/pandas_postprocessing.py ########## @@ -275,6 +285,12 @@ def pivot( # pylint: disable=too-many-arguments margins_name=marginal_distribution_name, ) + if not drop_missing_columns and len(series_set) > 0 and not df.empty: + for col in df.columns: + series = "".join([str(_) for _ in col]) Review comment: Since we're reusing this `"".join`, could we make a lambda for it to avoid having the same logic twice? ########## File path: superset/utils/pandas_postprocessing.py ########## @@ -264,6 +264,16 @@ def pivot( # pylint: disable=too-many-arguments # Remove once/if support is added. aggfunc = {na.column: na.aggfunc for na in aggregate_funcs.values()} + # When dropna = False, the pivot_table function will calculate cartesian-product + # for MultiIndex. + # https://github.com/apache/superset/issues/15956 + # https://github.com/pandas-dev/pandas/issues/18030 + series_set = set() + if not drop_missing_columns and columns: + for row in df[columns].itertuples(): + metrics_and_series = list(aggfunc.keys()) + list(row[1:]) + series_set.add("".join([str(_) for _ in metrics_and_series])) Review comment: Let's put a character in the join to remove any risk of collisions. Something like `"_".join([str(_) for _ in metrics_and_series])` should be enough. ########## File path: tests/integration_tests/pandas_postprocessing_tests.py ########## @@ -256,6 +257,29 @@ def test_pivot_exceptions(self): aggregates={"idx_nulls": {}}, ) + def test_pivot_eliminate_cartesian_product_columns(self): + mock_df = DataFrame( + { + "dttm": to_datetime(["2019-01-01", "2019-01-01"]), + "a": [0, 1], + "b": [0, 1], + "metric": [9, np.NAN], + } + ) + + df = proc.pivot( + df=mock_df, + index=["dttm"], + columns=["a", "b"], + aggregates={"metric": {"operator": "mean"}}, + drop_missing_columns=False, + ) + print(df) + self.assertEqual(df.columns[1], "0, 0") + self.assertEqual(df.columns[2], "1, 1") Review comment: Could we assert the full `columns` property here? Something like ```python self.assertListEqual(df.columns.tolist(), ["__timestamp", "0, 0", "1, 1"]) ``` ########## File path: tests/integration_tests/pandas_postprocessing_tests.py ########## @@ -256,6 +257,29 @@ def test_pivot_exceptions(self): aggregates={"idx_nulls": {}}, ) + def test_pivot_eliminate_cartesian_product_columns(self): + mock_df = DataFrame( + { + "dttm": to_datetime(["2019-01-01", "2019-01-01"]), + "a": [0, 1], + "b": [0, 1], + "metric": [9, np.NAN], + } + ) + + df = proc.pivot( + df=mock_df, + index=["dttm"], + columns=["a", "b"], + aggregates={"metric": {"operator": "mean"}}, + drop_missing_columns=False, + ) + print(df) Review comment: Whoops ```suggestion ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: notifications-unsubscr...@superset.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: notifications-unsubscr...@superset.apache.org For additional commands, e-mail: notifications-h...@superset.apache.org