[GitHub] [superset] villebro commented on a change in pull request #15975: fix: eliminate cartesian product columns in pivot operator

GitBox Fri, 30 Jul 2021 04:23:36 -0700


villebro commented on a change in pull request #15975:
URL: https://github.com/apache/superset/pull/15975#discussion_r679848692




##########
File path: superset/utils/pandas_postprocessing.py
##########
@@ -275,6 +285,12 @@ def pivot(  # pylint: disable=too-many-arguments
         margins_name=marginal_distribution_name,
     )
 
+    if not drop_missing_columns and len(series_set) > 0 and not df.empty:
+        for col in df.columns:
+            series = "".join([str(_) for _ in col])

Review comment:
       Since we're reusing this `"".join`, could we make a lambda for it to 
avoid having the same logic twice?

##########
File path: superset/utils/pandas_postprocessing.py
##########
@@ -264,6 +264,16 @@ def pivot(  # pylint: disable=too-many-arguments
     #  Remove once/if support is added.
     aggfunc = {na.column: na.aggfunc for na in aggregate_funcs.values()}
 
+    # When dropna = False, the pivot_table function will calculate 
cartesian-product
+    # for MultiIndex.
+    # https://github.com/apache/superset/issues/15956
+    # https://github.com/pandas-dev/pandas/issues/18030
+    series_set = set()
+    if not drop_missing_columns and columns:
+        for row in df[columns].itertuples():
+            metrics_and_series = list(aggfunc.keys()) + list(row[1:])
+            series_set.add("".join([str(_) for _ in metrics_and_series]))

Review comment:
       Let's put a character in the join to remove any risk of collisions. 
Something like `"_".join([str(_) for _ in metrics_and_series])` should be 
enough.

##########
File path: tests/integration_tests/pandas_postprocessing_tests.py
##########
@@ -256,6 +257,29 @@ def test_pivot_exceptions(self):
             aggregates={"idx_nulls": {}},
         )
 
+    def test_pivot_eliminate_cartesian_product_columns(self):
+        mock_df = DataFrame(
+            {
+                "dttm": to_datetime(["2019-01-01", "2019-01-01"]),
+                "a": [0, 1],
+                "b": [0, 1],
+                "metric": [9, np.NAN],
+            }
+        )
+
+        df = proc.pivot(
+            df=mock_df,
+            index=["dttm"],
+            columns=["a", "b"],
+            aggregates={"metric": {"operator": "mean"}},
+            drop_missing_columns=False,
+        )
+        print(df)
+        self.assertEqual(df.columns[1], "0, 0")
+        self.assertEqual(df.columns[2], "1, 1")

Review comment:
       Could we assert the full `columns` property here? Something like
   ```python
   self.assertListEqual(df.columns.tolist(), ["__timestamp", "0, 0", "1, 1"])
   ```

##########
File path: tests/integration_tests/pandas_postprocessing_tests.py
##########
@@ -256,6 +257,29 @@ def test_pivot_exceptions(self):
             aggregates={"idx_nulls": {}},
         )
 
+    def test_pivot_eliminate_cartesian_product_columns(self):
+        mock_df = DataFrame(
+            {
+                "dttm": to_datetime(["2019-01-01", "2019-01-01"]),
+                "a": [0, 1],
+                "b": [0, 1],
+                "metric": [9, np.NAN],
+            }
+        )
+
+        df = proc.pivot(
+            df=mock_df,
+            index=["dttm"],
+            columns=["a", "b"],
+            aggregates={"metric": {"operator": "mean"}},
+            drop_missing_columns=False,
+        )
+        print(df)

Review comment:
       Whoops
   ```suggestion
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscr...@superset.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscr...@superset.apache.org
For additional commands, e-mail: notifications-h...@superset.apache.org

[GitHub] [superset] villebro commented on a change in pull request #15975: fix: eliminate cartesian product columns in pivot operator

Reply via email to