[GitHub] [spark] allisonwang-db commented on a diff in pull request #42596: [SPARK-44903][PYTHON][DOCS] Refine docstring of `approx_count_distinct`

via GitHub Tue, 22 Aug 2023 09:50:01 -0700


allisonwang-db commented on code in PR #42596:
URL: https://github.com/apache/spark/pull/42596#discussion_r1301927648



##########
python/pyspark/sql/functions.py:
##########
@@ -3669,38 +3669,83 @@ def approxCountDistinct(col: "ColumnOrName", rsd: 
Optional[float] = None) -> Col
 
 @try_remote_functions
 def approx_count_distinct(col: "ColumnOrName", rsd: Optional[float] = None) -> 
Column:
-    """Aggregate function: returns a new :class:`~pyspark.sql.Column` for 
approximate distinct count
-    of column `col`.
+    """
+    Applies an aggregate function to return an approximate distinct count of 
the specified column.
 
-    .. versionadded:: 2.1.0
+    This function returns a new :class:`~pyspark.sql.Column` that estimates 
the number of distinct
+    elements in a column or a group of columns.
 
-    .. versionchanged:: 3.4.0
-        Supports Spark Connect.
+    .. versionadded:: 2.1.0
 
     .. versionchanged:: 3.4.0
         Supports Spark Connect.
 
     Parameters
     ----------
     col : :class:`~pyspark.sql.Column` or str
+        The label of the column to count distinct values in.
     rsd : float, optional
-        maximum relative standard deviation allowed (default = 0.05).
-        For rsd < 0.01, it is more efficient to use :func:`count_distinct`
+        The maximum allowed relative standard deviation (default = 0.05).
+        If rsd < 0.01, it would be more efficient to use 
:func:`count_distinct`.
 
     Returns
     -------
     :class:`~pyspark.sql.Column`
-        the column of computed results.
+        A new Column object representing the approximate unique count.
+
+    See Also
+    ----------
+    :meth:`pyspark.sql.functions.count_distinct`
 
     Examples
     --------
+    Example 1: Counting distinct values in a single column DataFrame 
representing integers
+
+    >>> from pyspark.sql.functions import approx_count_distinct
     >>> df = spark.createDataFrame([1,2,2,3], "INT")
     >>> df.agg(approx_count_distinct("value").alias('distinct_values')).show()
     +---------------+
     |distinct_values|
     +---------------+
     |              3|
     +---------------+
+
+    Example 2: Counting distinct values in a single column DataFrame 
representing strings
+
+    >>> from pyspark.sql.functions import approx_count_distinct
+    >>> df = spark.createDataFrame(["apple", "orange", "apple", "banana"], 
"string").toDF("fruit")

Review Comment:
   Nit: instead of using `toDF`, can we specify the schema of the dataframe 
when creating it?



##########
python/pyspark/sql/functions.py:
##########
@@ -3669,38 +3669,83 @@ def approxCountDistinct(col: "ColumnOrName", rsd: 
Optional[float] = None) -> Col
 
 @try_remote_functions
 def approx_count_distinct(col: "ColumnOrName", rsd: Optional[float] = None) -> 
Column:
-    """Aggregate function: returns a new :class:`~pyspark.sql.Column` for 
approximate distinct count
-    of column `col`.
+    """
+    Applies an aggregate function to return an approximate distinct count of 
the specified column.
 
-    .. versionadded:: 2.1.0
+    This function returns a new :class:`~pyspark.sql.Column` that estimates 
the number of distinct
+    elements in a column or a group of columns.
 
-    .. versionchanged:: 3.4.0
-        Supports Spark Connect.
+    .. versionadded:: 2.1.0
 
     .. versionchanged:: 3.4.0
         Supports Spark Connect.
 
     Parameters
     ----------
     col : :class:`~pyspark.sql.Column` or str
+        The label of the column to count distinct values in.
     rsd : float, optional
-        maximum relative standard deviation allowed (default = 0.05).
-        For rsd < 0.01, it is more efficient to use :func:`count_distinct`
+        The maximum allowed relative standard deviation (default = 0.05).
+        If rsd < 0.01, it would be more efficient to use 
:func:`count_distinct`.
 
     Returns
     -------
     :class:`~pyspark.sql.Column`
-        the column of computed results.
+        A new Column object representing the approximate unique count.
+
+    See Also
+    ----------
+    :meth:`pyspark.sql.functions.count_distinct`
 
     Examples
     --------
+    Example 1: Counting distinct values in a single column DataFrame 
representing integers
+
+    >>> from pyspark.sql.functions import approx_count_distinct
     >>> df = spark.createDataFrame([1,2,2,3], "INT")

Review Comment:
   Can we use lower case for `INT` here?



##########
python/pyspark/sql/functions.py:
##########
@@ -3669,38 +3669,83 @@ def approxCountDistinct(col: "ColumnOrName", rsd: 
Optional[float] = None) -> Col
 
 @try_remote_functions
 def approx_count_distinct(col: "ColumnOrName", rsd: Optional[float] = None) -> 
Column:
-    """Aggregate function: returns a new :class:`~pyspark.sql.Column` for 
approximate distinct count
-    of column `col`.
+    """
+    Applies an aggregate function to return an approximate distinct count of 
the specified column.
 
-    .. versionadded:: 2.1.0
+    This function returns a new :class:`~pyspark.sql.Column` that estimates 
the number of distinct
+    elements in a column or a group of columns.

Review Comment:
   I think this is a bit redundant. Can we combine it with the previous 
sentence?



##########
python/pyspark/sql/functions.py:
##########
@@ -3669,38 +3669,83 @@ def approxCountDistinct(col: "ColumnOrName", rsd: 
Optional[float] = None) -> Col
 
 @try_remote_functions
 def approx_count_distinct(col: "ColumnOrName", rsd: Optional[float] = None) -> 
Column:
-    """Aggregate function: returns a new :class:`~pyspark.sql.Column` for 
approximate distinct count
-    of column `col`.
+    """
+    Applies an aggregate function to return an approximate distinct count of 
the specified column.
 
-    .. versionadded:: 2.1.0
+    This function returns a new :class:`~pyspark.sql.Column` that estimates 
the number of distinct
+    elements in a column or a group of columns.
 
-    .. versionchanged:: 3.4.0
-        Supports Spark Connect.
+    .. versionadded:: 2.1.0
 
     .. versionchanged:: 3.4.0
         Supports Spark Connect.
 
     Parameters
     ----------
     col : :class:`~pyspark.sql.Column` or str
+        The label of the column to count distinct values in.
     rsd : float, optional
-        maximum relative standard deviation allowed (default = 0.05).
-        For rsd < 0.01, it is more efficient to use :func:`count_distinct`
+        The maximum allowed relative standard deviation (default = 0.05).
+        If rsd < 0.01, it would be more efficient to use 
:func:`count_distinct`.
 
     Returns
     -------
     :class:`~pyspark.sql.Column`
-        the column of computed results.
+        A new Column object representing the approximate unique count.
+
+    See Also
+    ----------
+    :meth:`pyspark.sql.functions.count_distinct`
 
     Examples
     --------
+    Example 1: Counting distinct values in a single column DataFrame 
representing integers
+
+    >>> from pyspark.sql.functions import approx_count_distinct
     >>> df = spark.createDataFrame([1,2,2,3], "INT")
     >>> df.agg(approx_count_distinct("value").alias('distinct_values')).show()
     +---------------+
     |distinct_values|
     +---------------+
     |              3|
     +---------------+
+
+    Example 2: Counting distinct values in a single column DataFrame 
representing strings
+
+    >>> from pyspark.sql.functions import approx_count_distinct
+    >>> df = spark.createDataFrame(["apple", "orange", "apple", "banana"], 
"string").toDF("fruit")
+    >>> df.agg(approx_count_distinct("fruit").alias('distinct_fruits')).show()
+    +---------------+
+    |distinct_fruits|
+    +---------------+
+    |              3|
+    +---------------+
+
+    Example 3: Counting distinct values in a DataFrame with multiple columns
+
+    >>> from pyspark.sql.functions import approx_count_distinct, struct
+    >>> df = spark.createDataFrame([("Alice", 1),
+    ...                             ("Alice", 2),
+    ...                             ("Bob", 3),
+    ...                             ("Bob", 3)], ["name", "value"])
+    >>> df = df.withColumn("combined", struct("name", "value"))
+    >>> 
df.agg(approx_count_distinct(df["combined"]).alias('distinct_pairs')).show()

Review Comment:
   can we make this example consistent with the others by using 
`approx_count_distinct("combined")`?



##########
python/pyspark/sql/functions.py:
##########
@@ -3669,38 +3669,83 @@ def approxCountDistinct(col: "ColumnOrName", rsd: 
Optional[float] = None) -> Col
 
 @try_remote_functions
 def approx_count_distinct(col: "ColumnOrName", rsd: Optional[float] = None) -> 
Column:
-    """Aggregate function: returns a new :class:`~pyspark.sql.Column` for 
approximate distinct count
-    of column `col`.
+    """
+    Applies an aggregate function to return an approximate distinct count of 
the specified column.
 
-    .. versionadded:: 2.1.0
+    This function returns a new :class:`~pyspark.sql.Column` that estimates 
the number of distinct
+    elements in a column or a group of columns.
 
-    .. versionchanged:: 3.4.0
-        Supports Spark Connect.
+    .. versionadded:: 2.1.0
 
     .. versionchanged:: 3.4.0
         Supports Spark Connect.
 
     Parameters
     ----------
     col : :class:`~pyspark.sql.Column` or str
+        The label of the column to count distinct values in.
     rsd : float, optional
-        maximum relative standard deviation allowed (default = 0.05).
-        For rsd < 0.01, it is more efficient to use :func:`count_distinct`
+        The maximum allowed relative standard deviation (default = 0.05).
+        If rsd < 0.01, it would be more efficient to use 
:func:`count_distinct`.
 
     Returns
     -------
     :class:`~pyspark.sql.Column`
-        the column of computed results.
+        A new Column object representing the approximate unique count.
+
+    See Also
+    ----------
+    :meth:`pyspark.sql.functions.count_distinct`
 
     Examples
     --------
+    Example 1: Counting distinct values in a single column DataFrame 
representing integers
+
+    >>> from pyspark.sql.functions import approx_count_distinct
     >>> df = spark.createDataFrame([1,2,2,3], "INT")
     >>> df.agg(approx_count_distinct("value").alias('distinct_values')).show()
     +---------------+
     |distinct_values|
     +---------------+
     |              3|
     +---------------+
+
+    Example 2: Counting distinct values in a single column DataFrame 
representing strings
+
+    >>> from pyspark.sql.functions import approx_count_distinct
+    >>> df = spark.createDataFrame(["apple", "orange", "apple", "banana"], 
"string").toDF("fruit")
+    >>> df.agg(approx_count_distinct("fruit").alias('distinct_fruits')).show()
+    +---------------+
+    |distinct_fruits|
+    +---------------+
+    |              3|
+    +---------------+
+
+    Example 3: Counting distinct values in a DataFrame with multiple columns
+
+    >>> from pyspark.sql.functions import approx_count_distinct, struct
+    >>> df = spark.createDataFrame([("Alice", 1),
+    ...                             ("Alice", 2),
+    ...                             ("Bob", 3),
+    ...                             ("Bob", 3)], ["name", "value"])
+    >>> df = df.withColumn("combined", struct("name", "value"))
+    >>> 
df.agg(approx_count_distinct(df["combined"]).alias('distinct_pairs')).show()
+    +--------------+
+    |distinct_pairs|
+    +--------------+
+    |             3|
+    +--------------+
+
+    Example 4: Counting distinct values with a specified relative standard 
deviation
+
+    >>> from pyspark.sql.functions import approx_count_distinct
+    >>> df = spark.range(100000)
+    >>> df.agg(approx_count_distinct("id", 
0.1).alias('distinct_values')).show()

Review Comment:
   can we compare the results of approx_count_distinct when 1) using the 
default rsd and 2) using 0.1 in this example?



##########
python/pyspark/sql/functions.py:
##########
@@ -3669,38 +3669,83 @@ def approxCountDistinct(col: "ColumnOrName", rsd: 
Optional[float] = None) -> Col
 
 @try_remote_functions
 def approx_count_distinct(col: "ColumnOrName", rsd: Optional[float] = None) -> 
Column:
-    """Aggregate function: returns a new :class:`~pyspark.sql.Column` for 
approximate distinct count
-    of column `col`.
+    """
+    Applies an aggregate function to return an approximate distinct count of 
the specified column.
 
-    .. versionadded:: 2.1.0
+    This function returns a new :class:`~pyspark.sql.Column` that estimates 
the number of distinct
+    elements in a column or a group of columns.
 
-    .. versionchanged:: 3.4.0
-        Supports Spark Connect.
+    .. versionadded:: 2.1.0
 
     .. versionchanged:: 3.4.0
         Supports Spark Connect.
 
     Parameters
     ----------
     col : :class:`~pyspark.sql.Column` or str
+        The label of the column to count distinct values in.
     rsd : float, optional
-        maximum relative standard deviation allowed (default = 0.05).
-        For rsd < 0.01, it is more efficient to use :func:`count_distinct`
+        The maximum allowed relative standard deviation (default = 0.05).
+        If rsd < 0.01, it would be more efficient to use 
:func:`count_distinct`.
 
     Returns
     -------
     :class:`~pyspark.sql.Column`
-        the column of computed results.
+        A new Column object representing the approximate unique count.
+
+    See Also
+    ----------
+    :meth:`pyspark.sql.functions.count_distinct`
 
     Examples
     --------
+    Example 1: Counting distinct values in a single column DataFrame 
representing integers
+
+    >>> from pyspark.sql.functions import approx_count_distinct

Review Comment:
   yea I would also prefer `from pyspark.sql.functions import 
approx_count_distinct`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] allisonwang-db commented on a diff in pull request #42596: [SPARK-44903][PYTHON][DOCS] Refine docstring of `approx_count_distinct`

Reply via email to