[GitHub] [beam] TheNeuralBit commented on a change in pull request #13853: [BEAM-9547] Produce better errors for some groupby() and set_index() configurations

GitBox Mon, 08 Feb 2021 08:53:10 -0800


TheNeuralBit commented on a change in pull request #13853:
URL: https://github.com/apache/beam/pull/13853#discussion_r572202678




##########
File path: sdks/python/apache_beam/dataframe/frames_test.py
##########
@@ -182,14 +194,53 @@ def test_groupby_project(self):
     self._run_test(lambda df: df.groupby('group')['foo'].median(), df)
     self._run_test(lambda df: df.groupby('group')['baz'].median(), df)
     self._run_test(lambda df: df.groupby('group')[['bar', 'baz']].median(), df)
+
+  def test_groupby_errors(self):
+    df = pd.DataFrame({
+        'group': ['a' if i % 5 == 0 or i % 3 == 0 else 'b' for i in 
range(100)],
+        'foo': [None if i % 11 == 0 else i for i in range(100)],
+        'bar': [None if i % 7 == 0 else 99 - i for i in range(100)],
+        'baz': [None if i % 13 == 0 else i * 2 for i in range(100)],
+    })
+
+    # non-existent projection column
     self._run_test(
         lambda df: df.groupby('group')[['bar', 'baz']].bar.median(),
         df,
         expect_error=True)
     self._run_test(
-        lambda df: df.groupby('group')[['bat']].median(), df, 
expect_error=True)
+        lambda df: df.groupby('group')[['bad']].median(), df, 
expect_error=True)
+
+    self._run_test(
+        lambda df: df.groupby('group').bad.median(), df, expect_error=True)
+
+    # non-existent grouping label
+    self._run_test(
+        lambda df: df.groupby(['really_bad', 'foo', 'bad']).foo.sum(),
+        df,
+        expect_error=True)
+    self._run_test(
+        lambda df: df.groupby('bad').foo.sum(), df, expect_error=True)
+
+  def test_set_index(self):
+    df = pd.DataFrame({
+        # Generate some unique columns to use for indexes

Review comment:
       Thanks these are good questions :)
   
   Partitioning is done by hashing the index:
   
https://github.com/apache/beam/blob/befcc3d780d561e81f23512742862a65c0ae3b69/sdks/python/apache_beam/dataframe/partitionings.py#L107-L116
   
   So depending on what happens with the 
[hashing](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.util.hash_array.html)
 we could get lucky or unlucky with the partitioning. A larger input doesn't 
even remove the risk, just reduces it.
   
   Another (better?) guard might be to run tests multiple times with different 
hash keys (or a single randomized hash key, but that would introduce 
flakiness). But we do need to be careful to make sure the hash key is 
consistent within an execution since we sometimes rely on the partitioning for 
joins.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] TheNeuralBit commented on a change in pull request #13853: [BEAM-9547] Produce better errors for some groupby() and set_index() configurations

Reply via email to