[GitHub] [spark] Yikun commented on a diff in pull request #36414: [SPARK-39077][PYTHON] Implement `skipna` of common statistical functions of DataFrame and Series

GitBox Sun, 08 May 2022 20:43:14 -0700


Yikun commented on code in PR #36414:
URL: https://github.com/apache/spark/pull/36414#discussion_r867604372



##########
python/pyspark/pandas/frame.py:
##########
@@ -583,6 +583,7 @@ def _reduce_for_stat_function(
         name: str,
         axis: Optional[Axis] = None,
         numeric_only: bool = True,
+        skipna: bool = True,

Review Comment:
   add parameter doc for `skipna`



##########
python/pyspark/pandas/generic.py:
##########
@@ -1312,11 +1326,20 @@ def sum(psser: "Series") -> Column:
             return F.coalesce(F.sum(spark_column), SF.lit(0))
 
         return self._reduce_for_stat_function(
-            sum, name="sum", axis=axis, numeric_only=numeric_only, 
min_count=min_count
+            sum,
+            name="sum",
+            axis=axis,
+            numeric_only=numeric_only,
+            min_count=min_count,
+            skipna=skipna,
         )
 
     def product(
-        self, axis: Optional[Axis] = None, numeric_only: bool = None, 
min_count: int = 0
+        self,
+        axis: Optional[Axis] = None,
+        skipna: bool = True,

Review Comment:
   +1 to keep pandas behavior, we might want to mention this in migration guide?



##########
python/pyspark/pandas/series.py:
##########
@@ -6859,13 +6860,16 @@ def _reduce_for_stat_function(
         sfun : the stats function to be used for aggregation
         name : original pandas API name.
         axis : used only for sanity check because series only support index 
axis.
-        numeric_only : not used by this implementation, but passed down by 
stats functions
+        numeric_only : not used by this implementation, but passed down by 
stats functions.
         """
         axis = validate_axis(axis)
         if axis == 1:
             raise NotImplementedError("Series does not support columns axis.")
 
-        scol = sfun(self)
+        if not skipna and self.hasnans:

Review Comment:
   +1, see 
https://github.com/apache/spark/commit/e031d00b52e1973ea62f0a80da948f169c73dad9 
as ref



##########
python/pyspark/pandas/series.py:
##########
@@ -6859,13 +6860,16 @@ def _reduce_for_stat_function(
         sfun : the stats function to be used for aggregation
         name : original pandas API name.
         axis : used only for sanity check because series only support index 
axis.
-        numeric_only : not used by this implementation, but passed down by 
stats functions
+        numeric_only : not used by this implementation, but passed down by 
stats functions.

Review Comment:
   doc for `skipna`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] Yikun commented on a diff in pull request #36414: [SPARK-39077][PYTHON] Implement `skipna` of common statistical functions of DataFrame and Series

Reply via email to