[GitHub] [beam] TheNeuralBit commented on a change in pull request #17026: [BEAM-13948] Add unstack(), a non-deferred column operation on categorical columns

GitBox Fri, 25 Mar 2022 13:18:24 -0700


TheNeuralBit commented on a change in pull request #17026:
URL: https://github.com/apache/beam/pull/17026#discussion_r835577432




##########
File path: sdks/python/apache_beam/dataframe/frames.py
##########
@@ -957,6 +957,69 @@ def truncate(df):
             requires_partition_by=partitionings.Arbitrary(),
             preserves_partition_by=partitionings.Arbitrary()))
 
+  @frame_base.with_docs_from(pd.DataFrame)
+  @frame_base.args_to_kwargs(pd.DataFrame)
+  @frame_base.populate_defaults(pd.DataFrame)
+  def unstack(self, **kwargs):
+    level = kwargs.get('level', -1)
+
+    if self._expr.proxy().index.nlevels == 1:
+      if PD_VERSION < (1, 2):
+        raise frame_base.WontImplementError(
+          "pandas==1.1.5 has an indexing error bug," \
+          " causing an error with unstack()")

Review comment:
       nit: I'm not sure it's helpful to reference the bug, could we change it 
to something like this?
   
   ```suggestion
             "unstack() is not supported when using pandas < 1.2.0" 
             "Please upgrade to pandas 1.2.0 or higher to use this operation.")
   ```

##########
File path: sdks/python/apache_beam/dataframe/frames_test.py
##########
@@ -1295,6 +1310,118 @@ def s_times_shuffled(times, s):
     self._run_test(lambda s: s.pipe(s_times, 2), s)
     self._run_test(lambda s: s.pipe((s_times_shuffled, 's'), 2), s)
 
+  def test_unstack_pandas_series_not_multiindex(self):
+    # Pandas should throw a ValueError if performing unstack
+    # on a Series without MultiIndex
+    s = pd.Series([1, 2, 3, 4], index=['one', 'two', 'three', 'four'])
+    with self.assertRaises((AttributeError, ValueError)):
+      self._run_test(lambda s: s.unstack(), s)
+
+  def test_unstack_non_categorical_index(self):
+    index = pd.MultiIndex.from_tuples([('one', 'a'), ('one', 'b'), ('two', 
'a'),
+                                       ('two', 'b')])
+    index = index.set_levels(
+        index.levels[0].astype(pd.CategoricalDtype(['one', 'two'])), level=0)
+    s = pd.Series(np.arange(1.0, 5.0), index=index)
+    with self.assertRaisesRegex(
+        frame_base.WontImplementError,
+        r"unstack\(\) is only supported on DataFrames if"):
+      self._run_test(lambda s: s.unstack(level=-1), s)
+
+  def _unstack_get_categorical_index(self):
+    index = pd.MultiIndex.from_tuples([('one', 'a'), ('one', 'b'), ('two', 
'a'),
+                                       ('two', 'b')])
+    index = index.set_levels(
+        index.levels[0].astype(pd.CategoricalDtype(['one', 'two'])), level=0)
+    index = index.set_levels(
+        index.levels[1].astype(pd.CategoricalDtype(['a', 'b'])), level=1)
+    return index
+
+  def test_unstack_pandas_example1(self):
+    index = self._unstack_get_categorical_index()
+    s = pd.Series(np.arange(1.0, 5.0), index=index)
+    self._run_test(lambda s: s.unstack(level=-1), s, check_dtypes=False)
+
+  def test_unstack_pandas_example2(self):
+    index = self._unstack_get_categorical_index()
+    s = pd.Series(np.arange(1.0, 5.0), index=index)
+    self._run_test(lambda s: s.unstack(level=0), s, check_dtypes=False)
+
+  def test_unstack_pandas_example3(self):
+    index = self._unstack_get_categorical_index()
+    s = pd.Series(np.arange(1.0, 5.0), index=index)
+    df = s.unstack(level=0)
+    if PD_VERSION < (1, 2):
+      with self.assertRaisesRegex(frame_base.WontImplementError,
+                                  r"pandas==1.1.5 has an indexing error bug"):
+        self._run_test(lambda df: df.unstack(), df, check_dtypes=False)
+    else:
+      self._run_test(lambda df: df.unstack(), df, check_dtypes=False)
+
+  @unittest.skipIf(
+      PD_VERSION < (1, 4),
+      "pandas=1.4 fixes error when concat() of boolean types results in 
object")

Review comment:
       How does this issue manifest for users with pandas < 1.4.0?

##########
File path: sdks/python/apache_beam/dataframe/frames_test.py
##########
@@ -131,7 +137,12 @@ def _run_test(
             This option should NOT be set to False in tests added for new
             operations if at all possible. Instead make sure the new operation
             produces the correct proxy. This flag only exists as an escape 
hatch
-            until existing failures can be addressed (BEAM-12379)."""
+            until existing failures can be addressed (BEAM-12379).
+        check_dtypes (bool): Whether or not to check the dtypes of the values.
+            The default is True. This option should be set to False for
+            non-deferred operations, where the dtype of the values in the proxy
+            are not known ahead of time.

Review comment:
       I think this is actually a little bit dangerous as documented. We don't 
want to support operations where the dtype of the values can't be known at 
construction time. It's important that we have an accurate proxy so we know 
what operations are valid on the output.
   
   That being said there are cases (like in this operations) where int64 
columns get coerced to float64 because of NaNs. The _best_ solution here  would 
be to have something like a "`lenient_dtype_check`" option, which just checks 
that numeric columns are still numeric between actual and proxy.
   
   I know that might be a bit more involved than what you want to get into for 
this PR though - if that's too much could you just update the docstring to say 
this is for operations with int64 to float64 coercion issues, and maybe a TODO 
to add `lenient_dtype_check`?

##########
File path: sdks/python/apache_beam/dataframe/frames_test.py
##########
@@ -1295,6 +1310,118 @@ def s_times_shuffled(times, s):
     self._run_test(lambda s: s.pipe(s_times, 2), s)
     self._run_test(lambda s: s.pipe((s_times_shuffled, 's'), 2), s)
 
+  def test_unstack_pandas_series_not_multiindex(self):
+    # Pandas should throw a ValueError if performing unstack
+    # on a Series without MultiIndex
+    s = pd.Series([1, 2, 3, 4], index=['one', 'two', 'three', 'four'])
+    with self.assertRaises((AttributeError, ValueError)):
+      self._run_test(lambda s: s.unstack(), s)
+
+  def test_unstack_non_categorical_index(self):
+    index = pd.MultiIndex.from_tuples([('one', 'a'), ('one', 'b'), ('two', 
'a'),
+                                       ('two', 'b')])
+    index = index.set_levels(
+        index.levels[0].astype(pd.CategoricalDtype(['one', 'two'])), level=0)
+    s = pd.Series(np.arange(1.0, 5.0), index=index)
+    with self.assertRaisesRegex(
+        frame_base.WontImplementError,
+        r"unstack\(\) is only supported on DataFrames if"):
+      self._run_test(lambda s: s.unstack(level=-1), s)
+
+  def _unstack_get_categorical_index(self):
+    index = pd.MultiIndex.from_tuples([('one', 'a'), ('one', 'b'), ('two', 
'a'),
+                                       ('two', 'b')])
+    index = index.set_levels(
+        index.levels[0].astype(pd.CategoricalDtype(['one', 'two'])), level=0)
+    index = index.set_levels(
+        index.levels[1].astype(pd.CategoricalDtype(['a', 'b'])), level=1)
+    return index
+
+  def test_unstack_pandas_example1(self):
+    index = self._unstack_get_categorical_index()
+    s = pd.Series(np.arange(1.0, 5.0), index=index)
+    self._run_test(lambda s: s.unstack(level=-1), s, check_dtypes=False)
+
+  def test_unstack_pandas_example2(self):
+    index = self._unstack_get_categorical_index()
+    s = pd.Series(np.arange(1.0, 5.0), index=index)
+    self._run_test(lambda s: s.unstack(level=0), s, check_dtypes=False)
+
+  def test_unstack_pandas_example3(self):
+    index = self._unstack_get_categorical_index()
+    s = pd.Series(np.arange(1.0, 5.0), index=index)
+    df = s.unstack(level=0)
+    if PD_VERSION < (1, 2):
+      with self.assertRaisesRegex(frame_base.WontImplementError,
+                                  r"pandas==1.1.5 has an indexing error bug"):
+        self._run_test(lambda df: df.unstack(), df, check_dtypes=False)

Review comment:
       nit: I would make this a separate test that's skipped unless `PD_VERSION 
< (1,2)`,  and make this test skipped unless `PD_VERSION >= (1,2)` 
   
   I don't feel that strongly about it if you'd rather leave it this way though.

##########
File path: sdks/python/apache_beam/dataframe/frames_test.py
##########
@@ -1295,6 +1310,118 @@ def s_times_shuffled(times, s):
     self._run_test(lambda s: s.pipe(s_times, 2), s)
     self._run_test(lambda s: s.pipe((s_times_shuffled, 's'), 2), s)
 
+  def test_unstack_pandas_series_not_multiindex(self):
+    # Pandas should throw a ValueError if performing unstack
+    # on a Series without MultiIndex
+    s = pd.Series([1, 2, 3, 4], index=['one', 'two', 'three', 'four'])
+    with self.assertRaises((AttributeError, ValueError)):
+      self._run_test(lambda s: s.unstack(), s)
+
+  def test_unstack_non_categorical_index(self):
+    index = pd.MultiIndex.from_tuples([('one', 'a'), ('one', 'b'), ('two', 
'a'),
+                                       ('two', 'b')])
+    index = index.set_levels(
+        index.levels[0].astype(pd.CategoricalDtype(['one', 'two'])), level=0)
+    s = pd.Series(np.arange(1.0, 5.0), index=index)
+    with self.assertRaisesRegex(
+        frame_base.WontImplementError,
+        r"unstack\(\) is only supported on DataFrames if"):
+      self._run_test(lambda s: s.unstack(level=-1), s)
+
+  def _unstack_get_categorical_index(self):
+    index = pd.MultiIndex.from_tuples([('one', 'a'), ('one', 'b'), ('two', 
'a'),
+                                       ('two', 'b')])
+    index = index.set_levels(
+        index.levels[0].astype(pd.CategoricalDtype(['one', 'two'])), level=0)
+    index = index.set_levels(
+        index.levels[1].astype(pd.CategoricalDtype(['a', 'b'])), level=1)
+    return index
+
+  def test_unstack_pandas_example1(self):
+    index = self._unstack_get_categorical_index()
+    s = pd.Series(np.arange(1.0, 5.0), index=index)
+    self._run_test(lambda s: s.unstack(level=-1), s, check_dtypes=False)
+
+  def test_unstack_pandas_example2(self):
+    index = self._unstack_get_categorical_index()
+    s = pd.Series(np.arange(1.0, 5.0), index=index)
+    self._run_test(lambda s: s.unstack(level=0), s, check_dtypes=False)
+
+  def test_unstack_pandas_example3(self):
+    index = self._unstack_get_categorical_index()
+    s = pd.Series(np.arange(1.0, 5.0), index=index)
+    df = s.unstack(level=0)
+    if PD_VERSION < (1, 2):
+      with self.assertRaisesRegex(frame_base.WontImplementError,
+                                  r"pandas==1.1.5 has an indexing error bug"):
+        self._run_test(lambda df: df.unstack(), df, check_dtypes=False)
+    else:
+      self._run_test(lambda df: df.unstack(), df, check_dtypes=False)
+
+  @unittest.skipIf(
+      PD_VERSION < (1, 4),
+      "pandas=1.4 fixes error when concat() of boolean types results in 
object")
+  def test_unstack_bool(self):
+    index = pd.MultiIndex.from_tuples([(True, 'a'), (True, 'b'), (False, 'a'),
+                                       (False, 'b')])
+    index = index.set_levels(index.levels[0].astype('boolean'), level=0)
+    index = index.set_levels(
+        index.levels[1].astype(pd.CategoricalDtype(['a', 'b'])), level=1)
+    s = pd.Series(np.arange(1.0, 5.0), index=index)
+    self._run_test(lambda s: s.unstack(level=0), s, check_dtypes=False)
+
+  def test_unstack_series_multiple_index_levels(self):
+    tuples = list(
+        zip(
+            *[
+                ["bar", "bar", "bar", "bar", "baz", "baz", "baz", "baz"],
+                ["one", "one", "two", "two", "one", "one", "two", "two"],
+                ["A", "B", "A", "B", "A", "B", "A", "B"],
+            ]))
+    index = pd.MultiIndex.from_tuples(
+        tuples, names=["first", "second", "third"])
+    index = index.set_levels(
+        index.levels[0].astype(pd.CategoricalDtype(['bar', 'baz'])), level=0)
+    index = index.set_levels(
+        index.levels[1].astype(pd.CategoricalDtype(['one', 'two'])), level=1)
+    index = index.set_levels(
+        index.levels[2].astype(pd.CategoricalDtype(['A', 'B'])), level=2)
+    df = pd.Series(np.random.randn(8), index=index)
+    self._run_test(
+        lambda df: df.unstack(level=['first', 'third']), df, 
check_dtypes=False)
+    # self.assert_frame_data_equivalent(
+    # result, df.unstack(level=['first', 'third']))

Review comment:
       ```suggestion
   ```

##########
File path: sdks/python/apache_beam/dataframe/frames_test.py
##########
@@ -1295,6 +1310,118 @@ def s_times_shuffled(times, s):
     self._run_test(lambda s: s.pipe(s_times, 2), s)
     self._run_test(lambda s: s.pipe((s_times_shuffled, 's'), 2), s)
 
+  def test_unstack_pandas_series_not_multiindex(self):
+    # Pandas should throw a ValueError if performing unstack
+    # on a Series without MultiIndex
+    s = pd.Series([1, 2, 3, 4], index=['one', 'two', 'three', 'four'])
+    with self.assertRaises((AttributeError, ValueError)):
+      self._run_test(lambda s: s.unstack(), s)
+
+  def test_unstack_non_categorical_index(self):
+    index = pd.MultiIndex.from_tuples([('one', 'a'), ('one', 'b'), ('two', 
'a'),
+                                       ('two', 'b')])
+    index = index.set_levels(
+        index.levels[0].astype(pd.CategoricalDtype(['one', 'two'])), level=0)
+    s = pd.Series(np.arange(1.0, 5.0), index=index)
+    with self.assertRaisesRegex(
+        frame_base.WontImplementError,
+        r"unstack\(\) is only supported on DataFrames if"):
+      self._run_test(lambda s: s.unstack(level=-1), s)
+
+  def _unstack_get_categorical_index(self):
+    index = pd.MultiIndex.from_tuples([('one', 'a'), ('one', 'b'), ('two', 
'a'),
+                                       ('two', 'b')])
+    index = index.set_levels(
+        index.levels[0].astype(pd.CategoricalDtype(['one', 'two'])), level=0)
+    index = index.set_levels(
+        index.levels[1].astype(pd.CategoricalDtype(['a', 'b'])), level=1)
+    return index
+
+  def test_unstack_pandas_example1(self):
+    index = self._unstack_get_categorical_index()
+    s = pd.Series(np.arange(1.0, 5.0), index=index)
+    self._run_test(lambda s: s.unstack(level=-1), s, check_dtypes=False)
+
+  def test_unstack_pandas_example2(self):
+    index = self._unstack_get_categorical_index()
+    s = pd.Series(np.arange(1.0, 5.0), index=index)
+    self._run_test(lambda s: s.unstack(level=0), s, check_dtypes=False)
+
+  def test_unstack_pandas_example3(self):
+    index = self._unstack_get_categorical_index()
+    s = pd.Series(np.arange(1.0, 5.0), index=index)
+    df = s.unstack(level=0)
+    if PD_VERSION < (1, 2):
+      with self.assertRaisesRegex(frame_base.WontImplementError,
+                                  r"pandas==1.1.5 has an indexing error bug"):
+        self._run_test(lambda df: df.unstack(), df, check_dtypes=False)
+    else:
+      self._run_test(lambda df: df.unstack(), df, check_dtypes=False)
+
+  @unittest.skipIf(
+      PD_VERSION < (1, 4),
+      "pandas=1.4 fixes error when concat() of boolean types results in 
object")
+  def test_unstack_bool(self):
+    index = pd.MultiIndex.from_tuples([(True, 'a'), (True, 'b'), (False, 'a'),
+                                       (False, 'b')])
+    index = index.set_levels(index.levels[0].astype('boolean'), level=0)
+    index = index.set_levels(
+        index.levels[1].astype(pd.CategoricalDtype(['a', 'b'])), level=1)
+    s = pd.Series(np.arange(1.0, 5.0), index=index)
+    self._run_test(lambda s: s.unstack(level=0), s, check_dtypes=False)
+
+  def test_unstack_series_multiple_index_levels(self):
+    tuples = list(
+        zip(
+            *[
+                ["bar", "bar", "bar", "bar", "baz", "baz", "baz", "baz"],
+                ["one", "one", "two", "two", "one", "one", "two", "two"],
+                ["A", "B", "A", "B", "A", "B", "A", "B"],
+            ]))
+    index = pd.MultiIndex.from_tuples(
+        tuples, names=["first", "second", "third"])
+    index = index.set_levels(
+        index.levels[0].astype(pd.CategoricalDtype(['bar', 'baz'])), level=0)
+    index = index.set_levels(
+        index.levels[1].astype(pd.CategoricalDtype(['one', 'two'])), level=1)
+    index = index.set_levels(
+        index.levels[2].astype(pd.CategoricalDtype(['A', 'B'])), level=2)
+    df = pd.Series(np.random.randn(8), index=index)
+    self._run_test(
+        lambda df: df.unstack(level=['first', 'third']), df, 
check_dtypes=False)
+    # self.assert_frame_data_equivalent(
+    # result, df.unstack(level=['first', 'third']))
+
+  def test_unstack_series_multiple_index_and_column_levels(self):
+    columns = pd.MultiIndex.from_tuples(
+        [
+            ("A", "cat", "long"),
+            ("B", "cat", "long"),
+            ("A", "dog", "short"),
+            ("B", "dog", "short"),
+        ],
+        names=["exp", "animal", "hair_length"],
+    )
+    index = pd.MultiIndex.from_product(
+        [['one', 'two'], ['a', 'b'], ['bar', 'baz']],
+        names=["first", "second", "third"])
+    index = index.set_levels(
+        index.levels[0].astype(pd.CategoricalDtype(['one', 'two'])), level=0)
+    index = index.set_levels(
+        index.levels[1].astype(pd.CategoricalDtype(['a', 'b'])), level=1)
+    index = index.set_levels(
+        index.levels[2].astype(pd.CategoricalDtype(['bar', 'baz'])), level=2)
+    df = pd.DataFrame(np.random.randn(8, 4), index=index, columns=columns)
+    df = df.stack(level=["animal", "hair_length"])
+    self._run_test(
+        lambda df: df.unstack(level=['second', 'third']),
+        df,
+        check_dtypes=False)
+    self._run_test(
+        lambda df: df.unstack(level=['second']), df, check_dtypes=False)
+    # self.assert_frame_data_equivalent(
+    # result, df.unstack(level=['second', 'third']))

Review comment:
       ```suggestion
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] TheNeuralBit commented on a change in pull request #17026: [BEAM-13948] Add unstack(), a non-deferred column operation on categorical columns

Reply via email to