TheNeuralBit commented on a change in pull request #13853:
URL: https://github.com/apache/beam/pull/13853#discussion_r572202678



##########
File path: sdks/python/apache_beam/dataframe/frames_test.py
##########
@@ -182,14 +194,53 @@ def test_groupby_project(self):
     self._run_test(lambda df: df.groupby('group')['foo'].median(), df)
     self._run_test(lambda df: df.groupby('group')['baz'].median(), df)
     self._run_test(lambda df: df.groupby('group')[['bar', 'baz']].median(), df)
+
+  def test_groupby_errors(self):
+    df = pd.DataFrame({
+        'group': ['a' if i % 5 == 0 or i % 3 == 0 else 'b' for i in 
range(100)],
+        'foo': [None if i % 11 == 0 else i for i in range(100)],
+        'bar': [None if i % 7 == 0 else 99 - i for i in range(100)],
+        'baz': [None if i % 13 == 0 else i * 2 for i in range(100)],
+    })
+
+    # non-existent projection column
     self._run_test(
         lambda df: df.groupby('group')[['bar', 'baz']].bar.median(),
         df,
         expect_error=True)
     self._run_test(
-        lambda df: df.groupby('group')[['bat']].median(), df, 
expect_error=True)
+        lambda df: df.groupby('group')[['bad']].median(), df, 
expect_error=True)
+
+    self._run_test(
+        lambda df: df.groupby('group').bad.median(), df, expect_error=True)
+
+    # non-existent grouping label
+    self._run_test(
+        lambda df: df.groupby(['really_bad', 'foo', 'bad']).foo.sum(),
+        df,
+        expect_error=True)
+    self._run_test(
+        lambda df: df.groupby('bad').foo.sum(), df, expect_error=True)
+
+  def test_set_index(self):
+    df = pd.DataFrame({
+        # Generate some unique columns to use for indexes

Review comment:
       Thanks these are good questions :)
   
   Partitioning is done by hashing the index:
   
https://github.com/apache/beam/blob/befcc3d780d561e81f23512742862a65c0ae3b69/sdks/python/apache_beam/dataframe/partitionings.py#L107-L116
   
   So depending on what happens with the 
[hashing](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.util.hash_array.html)
 we could get lucky or unlucky with the partitioning. A larger input doesn't 
even remove the risk, just reduces it.
   
   Another (better?) guard might be to run tests multiple times with different 
hash keys (or a single randomized hash key, but that would introduce flakiness).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to