[GitHub] [beam] TheNeuralBit commented on a change in pull request #17026: [BEAM-13948] Add unstack(), a non-deferred column operation on categorical columns

GitBox Mon, 14 Mar 2022 15:29:14 -0700


TheNeuralBit commented on a change in pull request #17026:
URL: https://github.com/apache/beam/pull/17026#discussion_r826428967




##########
File path: sdks/python/apache_beam/dataframe/frames_test.py
##########
@@ -2468,6 +2468,113 @@ def test_split_pat_is_regex(self):
     self.assert_frame_data_equivalent(
         result, s.str.split(r"\.jpg", regex=True, expand=False))
 
+  def test_unstack_pandas_series_not_multiindex(self):
+    # Pandas should throw a ValueError if performing unstack
+    # on a Series without MultiIndex
+    s = pd.Series([1, 2, 3, 4], index=['one', 'two', 'three', 'four'])
+    with self.assertRaises((AttributeError, ValueError)):
+      self._evaluate(lambda s: s.unstack(), s)
+
+  def test_unstack_non_categorical_index(self):
+    index = pd.MultiIndex.from_tuples([('one', 'a'), ('one', 'b'), ('two', 
'a'),
+                                       ('two', 'b')])
+    index = index.set_levels(
+        index.levels[0].astype(pd.CategoricalDtype(['one', 'two'])), level=0)
+    s = pd.Series(np.arange(1.0, 5.0), index=index)
+    with self.assertRaisesRegex(
+        frame_base.WontImplementError,
+        r"unstack\(\) is only supported on DataFrames if"):
+      self._evaluate(lambda s: s.unstack(level=-1), s)
+
+  def _unstack_get_categorical_index(self):
+    index = pd.MultiIndex.from_tuples([('one', 'a'), ('one', 'b'), ('two', 
'a'),
+                                       ('two', 'b')])
+    index = index.set_levels(
+        index.levels[0].astype(pd.CategoricalDtype(['one', 'two'])), level=0)
+    index = index.set_levels(
+        index.levels[1].astype(pd.CategoricalDtype(['a', 'b'])), level=1)
+    return index
+
+  def test_unstack_pandas_example1(self):
+    index = self._unstack_get_categorical_index()
+    s = pd.Series(np.arange(1.0, 5.0), index=index)
+    result = self._evaluate(lambda s: s.unstack(level=-1), s)
+    self.assert_frame_data_equivalent(result, s.unstack(level=-1))
+
+  def test_unstack_pandas_example2(self):
+    index = self._unstack_get_categorical_index()
+    s = pd.Series(np.arange(1.0, 5.0), index=index)
+    result = self._evaluate(lambda s: s.unstack(level=0), s)
+    self.assert_frame_data_equivalent(result, s.unstack(level=0))
+
+  @unittest.skipIf(

Review comment:
       How does it fail in Python 3.6, is it an error at pipeline construction 
time, or at execution time?
   
   If it's an execution time error it would be nice if we could fail faster by 
catching this case at construction time and raising an error.

##########
File path: sdks/python/apache_beam/dataframe/frames_test.py
##########
@@ -2468,6 +2468,112 @@ def test_split_pat_is_regex(self):
     self.assert_frame_data_equivalent(
         result, s.str.split(r"\.jpg", regex=True, expand=False))
 
+  def test_unstack_pandas_series_not_multiindex(self):
+    # Pandas should throw a ValueError if performing unstack
+    # on a Series without MultiIndex
+    s = pd.Series([1, 2, 3, 4], index=['one', 'two', 'three', 'four'])
+    with self.assertRaisesRegex(ValueError,
+                                r"index must be a MultiIndex to unstack"):
+      self._evaluate(lambda s: s.unstack(), s)
+
+  def test_unstack_non_categorical_index(self):
+    index = pd.MultiIndex.from_tuples([('one', 'a'), ('one', 'b'), ('two', 
'a'),
+                                       ('two', 'b')])
+    index = index.set_levels(
+        index.levels[0].astype(pd.CategoricalDtype(['one', 'two'])), level=0)
+    s = pd.Series(np.arange(1.0, 5.0), index=index)
+    with self.assertRaisesRegex(
+        frame_base.WontImplementError,
+        r"unstack\(\) is only supported on DataFrames if"):
+      self._evaluate(lambda s: s.unstack(level=-1), s)
+
+  def _unstack_get_categorical_index(self):
+    index = pd.MultiIndex.from_tuples([('one', 'a'), ('one', 'b'), ('two', 
'a'),
+                                       ('two', 'b')])
+    index = index.set_levels(
+        index.levels[0].astype(pd.CategoricalDtype(['one', 'two'])), level=0)
+    index = index.set_levels(
+        index.levels[1].astype(pd.CategoricalDtype(['a', 'b'])), level=1)
+    return index
+
+  def test_unstack_pandas_example1(self):
+    index = self._unstack_get_categorical_index()
+    s = pd.Series(np.arange(1.0, 5.0), index=index)
+    result = self._evaluate(lambda s: s.unstack(level=-1), s)
+    self.assert_frame_data_equivalent(result, s.unstack(level=-1))
+
+  def test_unstack_pandas_example2(self):
+    index = self._unstack_get_categorical_index()
+    s = pd.Series(np.arange(1.0, 5.0), index=index)
+    result = self._evaluate(lambda s: s.unstack(level=0), s)
+    self.assert_frame_data_equivalent(result, s.unstack(level=0))
+
+  def test_unstack_pandas_example3(self):
+    index = self._unstack_get_categorical_index()
+    s = pd.Series(np.arange(1.0, 5.0), index=index)
+    result = self._evaluate(lambda s: s.unstack(level=0).unstack(), s)
+    self.assert_frame_data_equivalent(result, s.unstack(level=0).unstack())
+
+  @unittest.skipIf(

Review comment:
       Same question here.
   
   Also if this is just an issue with concatting boolean typed indexes, it 
seems this would be a problem any time such an index is used in the DataFrame 
API (not just for unstack), since we rely heavily on concatting dataframes.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] TheNeuralBit commented on a change in pull request #17026: [BEAM-13948] Add unstack(), a non-deferred column operation on categorical columns

Reply via email to