[GitHub] [spark] ueshin commented on a change in pull request #33625: [SPARK-36397][PYTHON] Implement DataFrame.mode

GitBox Fri, 06 Aug 2021 13:25:04 -0700


ueshin commented on a change in pull request #33625:
URL: https://github.com/apache/spark/pull/33625#discussion_r684421935




##########
File path: python/pyspark/pandas/frame.py
##########
@@ -3459,6 +3458,111 @@ def mask(
         cond_inversed = cond._apply_series_op(lambda psser: ~psser)
         return self.where(cond_inversed, other)
 
+    # TODO: Support axis as 1 or 'columns'
+    def mode(
+        self, axis: Union[int, str] = 0, numeric_only: bool = False, dropna: 
bool = True

Review comment:
       nit: `Axis` instead of `Union[int, str]`?

##########
File path: python/pyspark/pandas/frame.py
##########
@@ -3459,6 +3458,111 @@ def mask(
         cond_inversed = cond._apply_series_op(lambda psser: ~psser)
         return self.where(cond_inversed, other)
 
+    # TODO: Support axis as 1 or 'columns'
+    def mode(
+        self, axis: Union[int, str] = 0, numeric_only: bool = False, dropna: 
bool = True
+    ) -> "DataFrame":
+        """
+        Get the mode(s) of each element along the selected axis.
+
+        The mode of a set of values is the value that appears most often.
+        It can be multiple values.
+
+        Parameters
+        ----------
+        axis : {0 or 'index', 1 or 'columns'}, default 0
+            The axis to iterate over while searching for the mode:
+            * 0 or 'index' : get mode of each column
+            * 1 or 'columns' : get mode of each row.
+
+        numeric_only : bool, default False
+            If True, only apply to numeric columns.
+
+        dropna : bool, default True
+            Don't consider counts of NaN/NaT.
+
+        Returns
+        -------
+        DataFrame
+            The modes of each column or row.
+
+        See Also
+        --------
+        Series.mode : Return the highest frequency value in a Series.
+        Series.value_counts : Return the counts of values in a Series.
+
+        Examples
+        --------
+        >>> psdf = ps.DataFrame(
+        ...     [("bird", 2, 2), ("mammal", 4, np.nan), ("arthropod", 8, 0), 
("bird", 2, np.nan)],
+        ...     index=("falcon", "horse", "spider", "ostrich"),
+        ...     columns=("species", "legs", "wings"),
+        ... )
+        >>> psdf
+                   species  legs  wings
+        falcon        bird     2    2.0
+        horse       mammal     4    NaN
+        spider   arthropod     8    0.0
+        ostrich       bird     2    NaN
+
+        >>> psdf.mode()  # doctest: +SKIP
+          species  legs  wings
+        0    bird   2.0    0.0
+        1    None   NaN    2.0
+
+        >>> psdf.mode(dropna=False)
+          species  legs  wings
+        0    bird     2    NaN
+
+        >>> psdf.mode(numeric_only=True)  # doctest: +SKIP
+           legs  wings
+        0   2.0    0.0
+        1   NaN    2.0
+
+        Notes
+        -----
+        The current implementation of mode requires joins multiple times
+        (columns count - 1 times when axis is 0 or 'index'), which is 
potentially expensive.
+
+        The order of multiple modes (within each column when axis is 0 or 
'index') is undetermined.

Review comment:
       I guess this section should be above the `Example` section. (should be 
above the `Parameters` section?)

##########
File path: python/pyspark/pandas/tests/test_dataframe.py
##########
@@ -1900,6 +1900,29 @@ def test_isin(self):
         with self.assertRaisesRegex(TypeError, msg):
             psdf.isin(1)
 
+    def test_mode(self):
+        pdf = pd.DataFrame(
+            [("bird", 2, 2), ("mammal", 4, 0), ("arthropod", 8, 0), ("bird", 
2, np.nan)],
+            index=("falcon", "horse", "spider", "ostrich"),
+            columns=("species", "legs", "wings"),
+        )
+        psdf = ps.from_pandas(pdf)
+        self.assert_eq(
+            psdf.mode(),
+            pdf.mode(),
+        )
+        self.assert_eq(
+            psdf.mode(numeric_only=True),
+            pdf.mode(numeric_only=True),
+        )

Review comment:
       Could you add tests with `pdf[[]].mode()`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] ueshin commented on a change in pull request #33625: [SPARK-36397][PYTHON] Implement DataFrame.mode

Reply via email to