gaogaotiantian commented on code in PR #54162:
URL: https://github.com/apache/spark/pull/54162#discussion_r2771409324
##########
python/pyspark/pandas/tests/groupby/test_stat.py:
##########
@@ -14,18 +14,22 @@
# See the License for the specific language governing permissions and
# limitations under the License.
#
+from typing import Optional, Tuple, Type
import numpy as np
import pandas as pd
from pyspark import pandas as ps
+from pyspark.loose_version import LooseVersion
from pyspark.testing.pandasutils import PandasOnSparkTestCase
from pyspark.testing.sqlutils import SQLTestUtils
class GroupbyStatTestingFuncMixin:
# TODO: All statistical functions should leverage this utility
- def _test_stat_func(self, func, check_exact=True):
+ def _test_stat_func(
+ self, func, check_exact=True, expected_error:
Optional[Tuple[Type[Exception], str]] = None
Review Comment:
Let's imagine how the code looks like after we drop support for pandas 2.x,
and how easy it is to drop our code for it.
I think expected error is fine, but message is probably a bit overkill. For
case like an invalid argument, we can just do a `assertRaises` - it's should be
simple to figure out what happened when it changes again. That will make our
code look cleaner.
We do have a lot of `assertRaises` in our test suite. I think
`assertRaisesRegex` is helpful to pick up a very specific exception in a
relatively complicated process.
##########
python/pyspark/pandas/tests/groupby/test_stat.py:
##########
@@ -35,11 +39,23 @@ def _test_stat_func(self, func, check_exact=True):
# Against SeriesGroupBy
(pdf.groupby("A")["B"], psdf.groupby("A")["B"]),
]:
- self.assert_eq(
- func(p_groupby_obj).sort_index(),
- func(ps_groupby_obj).sort_index(),
- check_exact=check_exact,
- )
+ if expected_error is None:
+ self.assert_eq(
+ func(p_groupby_obj).sort_index(),
+ func(ps_groupby_obj).sort_index(),
+ check_exact=check_exact,
+ )
+ else:
+ error_type, error_message = expected_error
+ self.assertRaisesRegex(error_type, error_message, lambda:
func(p_groupby_obj))
+ self.assertRaisesRegex(error_type, error_message, lambda:
func(ps_groupby_obj))
+
+ @property
+ def expected_error_numeric_only(self) -> Optional[Tuple[Type[Exception],
str]]:
Review Comment:
I don't think this is necessary. So let's imagine the code base after we
drop support for pandas 2, this would be
```python
def expected_error_numeric_only(self):
return (ValueError, "numeric_only accepts only Boolean values")
```
If we drop the message, it would be
```python
def expected_error_numeric_only(self):
return ValueError
```
Then we probably just need to remove the method again and change it for all
the places that call this.
I think for intermediate code (where we try to support both), it would be
better if we can have the code for different versions listed together - so when
we need to drop one, it's super trivial - just delete the if/else case.
##########
python/pyspark/pandas/tests/groupby/test_stat.py:
##########
@@ -35,11 +39,23 @@ def _test_stat_func(self, func, check_exact=True):
# Against SeriesGroupBy
(pdf.groupby("A")["B"], psdf.groupby("A")["B"]),
]:
- self.assert_eq(
- func(p_groupby_obj).sort_index(),
- func(ps_groupby_obj).sort_index(),
- check_exact=check_exact,
- )
+ if expected_error is None:
+ self.assert_eq(
+ func(p_groupby_obj).sort_index(),
+ func(ps_groupby_obj).sort_index(),
+ check_exact=check_exact,
+ )
+ else:
+ error_type, error_message = expected_error
+ self.assertRaisesRegex(error_type, error_message, lambda:
func(p_groupby_obj))
+ self.assertRaisesRegex(error_type, error_message, lambda:
func(ps_groupby_obj))
+
+ @property
+ def expected_error_numeric_only(self) -> Optional[Tuple[Type[Exception],
str]]:
+ if LooseVersion(pd.__version__) >= LooseVersion("3.0"):
Review Comment:
I don't think there's a right/wrong about it, but I've been using `if
LooseVersion(pd.__version__) < "3.0.0"`. I think `<` and `>=` both works. I
prefer the raw string just because it's shorter. "3.0.0" is the actual version
of pandas.
##########
python/pyspark/pandas/tests/groupby/test_stat.py:
##########
@@ -60,21 +76,33 @@ def psdf(self):
def test_mean(self):
self._test_stat_func(lambda groupby_obj:
groupby_obj.mean(numeric_only=True))
+ if LooseVersion(pd.__version__) >= LooseVersion("3.0"):
Review Comment:
I think it's debatable whether we need to start testing new features in
pandas 3 at this point. Maybe? It's not part of the work to "fix the current
test".
Actually if we throw away `self.expected_error_numeric_only`, you can run
this test in both 2 and 3 right? They all raise an error.
##########
python/pyspark/pandas/groupby.py:
##########
@@ -492,6 +493,7 @@ def first(self, numeric_only: Optional[bool] = False,
min_count: int = -1) -> Fr
a 1.0 True 3.0
b NaN None NaN
"""
+ validate_numeric_only(numeric_only)
Review Comment:
Same here, I don't think we need a util function here. Imagine after we drop
2.0, this should be just an `isinstance` check like below, which is very easy
to read. I think it should just be
```python
if LooseVersion(pd.__version__) >= "3.0.0":
if not isinstance(numeric_only, bool):
raise TypeError("numeric_only must be boolean")
```
Then when we need to drop 2.x, we just search all the
`LooseVersion(pd.__version__) ` and fix them accordingly, in the same file
hopefully.
BTW this should be a `TypeError`.
##########
python/pyspark/pandas/tests/groupby/test_stat.py:
##########
@@ -35,11 +39,23 @@ def _test_stat_func(self, func, check_exact=True):
# Against SeriesGroupBy
(pdf.groupby("A")["B"], psdf.groupby("A")["B"]),
]:
- self.assert_eq(
- func(p_groupby_obj).sort_index(),
- func(ps_groupby_obj).sort_index(),
- check_exact=check_exact,
- )
+ if expected_error is None:
+ self.assert_eq(
+ func(p_groupby_obj).sort_index(),
+ func(ps_groupby_obj).sort_index(),
+ check_exact=check_exact,
+ )
+ else:
+ error_type, error_message = expected_error
+ self.assertRaisesRegex(error_type, error_message, lambda:
func(p_groupby_obj))
Review Comment:
We have both patterns so I guess it's not a huge deal, but personally I like
```python
with self.assertRaisesRegex(...):
func()
```
better. I don't really like random lambda functions :)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]