TheNeuralBit commented on a change in pull request #13853:
URL: https://github.com/apache/beam/pull/13853#discussion_r572202678
##########
File path: sdks/python/apache_beam/dataframe/frames_test.py
##########
@@ -182,14 +194,53 @@ def test_groupby_project(self):
self._run_test(lambda df: df.groupby('group')['foo'].median(), df)
self._run_test(lambda df: df.groupby('group')['baz'].median(), df)
self._run_test(lambda df: df.groupby('group')[['bar', 'baz']].median(), df)
+
+ def test_groupby_errors(self):
+ df = pd.DataFrame({
+ 'group': ['a' if i % 5 == 0 or i % 3 == 0 else 'b' for i in
range(100)],
+ 'foo': [None if i % 11 == 0 else i for i in range(100)],
+ 'bar': [None if i % 7 == 0 else 99 - i for i in range(100)],
+ 'baz': [None if i % 13 == 0 else i * 2 for i in range(100)],
+ })
+
+ # non-existent projection column
self._run_test(
lambda df: df.groupby('group')[['bar', 'baz']].bar.median(),
df,
expect_error=True)
self._run_test(
- lambda df: df.groupby('group')[['bat']].median(), df,
expect_error=True)
+ lambda df: df.groupby('group')[['bad']].median(), df,
expect_error=True)
+
+ self._run_test(
+ lambda df: df.groupby('group').bad.median(), df, expect_error=True)
+
+ # non-existent grouping label
+ self._run_test(
+ lambda df: df.groupby(['really_bad', 'foo', 'bad']).foo.sum(),
+ df,
+ expect_error=True)
+ self._run_test(
+ lambda df: df.groupby('bad').foo.sum(), df, expect_error=True)
+
+ def test_set_index(self):
+ df = pd.DataFrame({
+ # Generate some unique columns to use for indexes
Review comment:
Thanks these are good questions :)
Partitioning is done by hashing the index:
https://github.com/apache/beam/blob/befcc3d780d561e81f23512742862a65c0ae3b69/sdks/python/apache_beam/dataframe/partitionings.py#L107-L116
So depending on what happens with the
[hashing](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.util.hash_array.html)
we could get lucky or unlucky with the partitioning. A larger input doesn't
even remove the risk, just reduces it.
Another (better?) guard might be to run tests multiple times with different
hash keys (or a single randomized hash key, but that would introduce
flakiness). But we do need to be careful to make sure the hash key is
consistent within an execution since we sometimes rely on the partitioning for
joins.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]