Re: [PR] [SPARK-55376][PS] Make numeric_only argument in groupby functions accept only boolean with pandas 3 [spark]

via GitHub Thu, 05 Feb 2026 14:37:37 -0800


gaogaotiantian commented on code in PR #54162:
URL: https://github.com/apache/spark/pull/54162#discussion_r2771409324



##########
python/pyspark/pandas/tests/groupby/test_stat.py:
##########
@@ -14,18 +14,22 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
+from typing import Optional, Tuple, Type
 
 import numpy as np
 import pandas as pd
 
 from pyspark import pandas as ps
+from pyspark.loose_version import LooseVersion
 from pyspark.testing.pandasutils import PandasOnSparkTestCase
 from pyspark.testing.sqlutils import SQLTestUtils
 
 
 class GroupbyStatTestingFuncMixin:
     # TODO: All statistical functions should leverage this utility
-    def _test_stat_func(self, func, check_exact=True):
+    def _test_stat_func(
+        self, func, check_exact=True, expected_error: 
Optional[Tuple[Type[Exception], str]] = None

Review Comment:
   Let's imagine how the code looks like after we drop support for pandas 2.x, 
and how easy it is to drop our code for it.
   
   I think expected error is fine, but message is probably a bit overkill. For 
case like an invalid argument, we can just do a `assertRaises` - it's should be 
simple to figure out what happened when it changes again. That will make our 
code look cleaner.
   
   We do have a lot of `assertRaises` in our test suite. I think 
`assertRaisesRegex` is helpful to pick up a very specific exception in a 
relatively complicated process.



##########
python/pyspark/pandas/tests/groupby/test_stat.py:
##########
@@ -35,11 +39,23 @@ def _test_stat_func(self, func, check_exact=True):
             # Against SeriesGroupBy
             (pdf.groupby("A")["B"], psdf.groupby("A")["B"]),
         ]:
-            self.assert_eq(
-                func(p_groupby_obj).sort_index(),
-                func(ps_groupby_obj).sort_index(),
-                check_exact=check_exact,
-            )
+            if expected_error is None:
+                self.assert_eq(
+                    func(p_groupby_obj).sort_index(),
+                    func(ps_groupby_obj).sort_index(),
+                    check_exact=check_exact,
+                )
+            else:
+                error_type, error_message = expected_error
+                self.assertRaisesRegex(error_type, error_message, lambda: 
func(p_groupby_obj))
+                self.assertRaisesRegex(error_type, error_message, lambda: 
func(ps_groupby_obj))
+
+    @property
+    def expected_error_numeric_only(self) -> Optional[Tuple[Type[Exception], 
str]]:

Review Comment:
   I don't think this is necessary. So let's imagine the code base after we 
drop support for pandas 2, this would be
   
   ```python
   def expected_error_numeric_only(self):
       return (ValueError, "numeric_only accepts only Boolean values")
   ```
   
   If we drop the message, it would be
   ```python
   def expected_error_numeric_only(self):
       return ValueError
   ```
   
   Then we probably just need to remove the method again and change it for all 
the places that call this.
   
   I think for intermediate code (where we try to support both), it would be 
better if we can have the code for different versions listed together - so when 
we need to drop one, it's super trivial - just delete the if/else case.



##########
python/pyspark/pandas/tests/groupby/test_stat.py:
##########
@@ -35,11 +39,23 @@ def _test_stat_func(self, func, check_exact=True):
             # Against SeriesGroupBy
             (pdf.groupby("A")["B"], psdf.groupby("A")["B"]),
         ]:
-            self.assert_eq(
-                func(p_groupby_obj).sort_index(),
-                func(ps_groupby_obj).sort_index(),
-                check_exact=check_exact,
-            )
+            if expected_error is None:
+                self.assert_eq(
+                    func(p_groupby_obj).sort_index(),
+                    func(ps_groupby_obj).sort_index(),
+                    check_exact=check_exact,
+                )
+            else:
+                error_type, error_message = expected_error
+                self.assertRaisesRegex(error_type, error_message, lambda: 
func(p_groupby_obj))
+                self.assertRaisesRegex(error_type, error_message, lambda: 
func(ps_groupby_obj))
+
+    @property
+    def expected_error_numeric_only(self) -> Optional[Tuple[Type[Exception], 
str]]:
+        if LooseVersion(pd.__version__) >= LooseVersion("3.0"):

Review Comment:
   I don't think there's a right/wrong about it, but I've been using `if 
LooseVersion(pd.__version__) < "3.0.0"`. I think `<` and `>=` both works. I 
prefer the raw string just because it's shorter. "3.0.0" is the actual version 
of pandas.



##########
python/pyspark/pandas/tests/groupby/test_stat.py:
##########
@@ -60,21 +76,33 @@ def psdf(self):
 
     def test_mean(self):
         self._test_stat_func(lambda groupby_obj: 
groupby_obj.mean(numeric_only=True))
+        if LooseVersion(pd.__version__) >= LooseVersion("3.0"):

Review Comment:
   I think it's debatable whether we need to start testing new features in 
pandas 3 at this point. Maybe? It's not part of the work to "fix the current 
test".
   
   Actually if we throw away `self.expected_error_numeric_only`, you can run 
this test in both 2 and 3 right? They all raise an error.



##########
python/pyspark/pandas/groupby.py:
##########
@@ -492,6 +493,7 @@ def first(self, numeric_only: Optional[bool] = False, 
min_count: int = -1) -> Fr
         a  1.0  True  3.0
         b  NaN  None  NaN
         """
+        validate_numeric_only(numeric_only)

Review Comment:
   Same here, I don't think we need a util function here. Imagine after we drop 
2.0, this should be just an `isinstance` check like below, which is very easy 
to read. I think it should just be
   
   ```python
   if LooseVersion(pd.__version__) >= "3.0.0":
       if not isinstance(numeric_only, bool):
           raise TypeError("numeric_only must be boolean")
   ```
   
   Then when we need to drop 2.x, we just search all the 
`LooseVersion(pd.__version__) ` and fix them accordingly, in the same file 
hopefully.
   
   BTW this should be a `TypeError`.



##########
python/pyspark/pandas/tests/groupby/test_stat.py:
##########
@@ -35,11 +39,23 @@ def _test_stat_func(self, func, check_exact=True):
             # Against SeriesGroupBy
             (pdf.groupby("A")["B"], psdf.groupby("A")["B"]),
         ]:
-            self.assert_eq(
-                func(p_groupby_obj).sort_index(),
-                func(ps_groupby_obj).sort_index(),
-                check_exact=check_exact,
-            )
+            if expected_error is None:
+                self.assert_eq(
+                    func(p_groupby_obj).sort_index(),
+                    func(ps_groupby_obj).sort_index(),
+                    check_exact=check_exact,
+                )
+            else:
+                error_type, error_message = expected_error
+                self.assertRaisesRegex(error_type, error_message, lambda: 
func(p_groupby_obj))

Review Comment:
   We have both patterns so I guess it's not a huge deal, but personally I like
   
   ```python
   with self.assertRaisesRegex(...):
       func()
   ```
   
   better. I don't really like random lambda functions :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-55376][PS] Make numeric_only argument in groupby functions accept only boolean with pandas 3 [spark]

Reply via email to