allisonwang-db commented on code in PR #42596:
URL: https://github.com/apache/spark/pull/42596#discussion_r1301927648
##########
python/pyspark/sql/functions.py:
##########
@@ -3669,38 +3669,83 @@ def approxCountDistinct(col: "ColumnOrName", rsd:
Optional[float] = None) -> Col
@try_remote_functions
def approx_count_distinct(col: "ColumnOrName", rsd: Optional[float] = None) ->
Column:
- """Aggregate function: returns a new :class:`~pyspark.sql.Column` for
approximate distinct count
- of column `col`.
+ """
+ Applies an aggregate function to return an approximate distinct count of
the specified column.
- .. versionadded:: 2.1.0
+ This function returns a new :class:`~pyspark.sql.Column` that estimates
the number of distinct
+ elements in a column or a group of columns.
- .. versionchanged:: 3.4.0
- Supports Spark Connect.
+ .. versionadded:: 2.1.0
.. versionchanged:: 3.4.0
Supports Spark Connect.
Parameters
----------
col : :class:`~pyspark.sql.Column` or str
+ The label of the column to count distinct values in.
rsd : float, optional
- maximum relative standard deviation allowed (default = 0.05).
- For rsd < 0.01, it is more efficient to use :func:`count_distinct`
+ The maximum allowed relative standard deviation (default = 0.05).
+ If rsd < 0.01, it would be more efficient to use
:func:`count_distinct`.
Returns
-------
:class:`~pyspark.sql.Column`
- the column of computed results.
+ A new Column object representing the approximate unique count.
+
+ See Also
+ ----------
+ :meth:`pyspark.sql.functions.count_distinct`
Examples
--------
+ Example 1: Counting distinct values in a single column DataFrame
representing integers
+
+ >>> from pyspark.sql.functions import approx_count_distinct
>>> df = spark.createDataFrame([1,2,2,3], "INT")
>>> df.agg(approx_count_distinct("value").alias('distinct_values')).show()
+---------------+
|distinct_values|
+---------------+
| 3|
+---------------+
+
+ Example 2: Counting distinct values in a single column DataFrame
representing strings
+
+ >>> from pyspark.sql.functions import approx_count_distinct
+ >>> df = spark.createDataFrame(["apple", "orange", "apple", "banana"],
"string").toDF("fruit")
Review Comment:
Nit: instead of using `toDF`, can we specify the schema of the dataframe
when creating it?
##########
python/pyspark/sql/functions.py:
##########
@@ -3669,38 +3669,83 @@ def approxCountDistinct(col: "ColumnOrName", rsd:
Optional[float] = None) -> Col
@try_remote_functions
def approx_count_distinct(col: "ColumnOrName", rsd: Optional[float] = None) ->
Column:
- """Aggregate function: returns a new :class:`~pyspark.sql.Column` for
approximate distinct count
- of column `col`.
+ """
+ Applies an aggregate function to return an approximate distinct count of
the specified column.
- .. versionadded:: 2.1.0
+ This function returns a new :class:`~pyspark.sql.Column` that estimates
the number of distinct
+ elements in a column or a group of columns.
- .. versionchanged:: 3.4.0
- Supports Spark Connect.
+ .. versionadded:: 2.1.0
.. versionchanged:: 3.4.0
Supports Spark Connect.
Parameters
----------
col : :class:`~pyspark.sql.Column` or str
+ The label of the column to count distinct values in.
rsd : float, optional
- maximum relative standard deviation allowed (default = 0.05).
- For rsd < 0.01, it is more efficient to use :func:`count_distinct`
+ The maximum allowed relative standard deviation (default = 0.05).
+ If rsd < 0.01, it would be more efficient to use
:func:`count_distinct`.
Returns
-------
:class:`~pyspark.sql.Column`
- the column of computed results.
+ A new Column object representing the approximate unique count.
+
+ See Also
+ ----------
+ :meth:`pyspark.sql.functions.count_distinct`
Examples
--------
+ Example 1: Counting distinct values in a single column DataFrame
representing integers
+
+ >>> from pyspark.sql.functions import approx_count_distinct
>>> df = spark.createDataFrame([1,2,2,3], "INT")
Review Comment:
Can we use lower case for `INT` here?
##########
python/pyspark/sql/functions.py:
##########
@@ -3669,38 +3669,83 @@ def approxCountDistinct(col: "ColumnOrName", rsd:
Optional[float] = None) -> Col
@try_remote_functions
def approx_count_distinct(col: "ColumnOrName", rsd: Optional[float] = None) ->
Column:
- """Aggregate function: returns a new :class:`~pyspark.sql.Column` for
approximate distinct count
- of column `col`.
+ """
+ Applies an aggregate function to return an approximate distinct count of
the specified column.
- .. versionadded:: 2.1.0
+ This function returns a new :class:`~pyspark.sql.Column` that estimates
the number of distinct
+ elements in a column or a group of columns.
Review Comment:
I think this is a bit redundant. Can we combine it with the previous
sentence?
##########
python/pyspark/sql/functions.py:
##########
@@ -3669,38 +3669,83 @@ def approxCountDistinct(col: "ColumnOrName", rsd:
Optional[float] = None) -> Col
@try_remote_functions
def approx_count_distinct(col: "ColumnOrName", rsd: Optional[float] = None) ->
Column:
- """Aggregate function: returns a new :class:`~pyspark.sql.Column` for
approximate distinct count
- of column `col`.
+ """
+ Applies an aggregate function to return an approximate distinct count of
the specified column.
- .. versionadded:: 2.1.0
+ This function returns a new :class:`~pyspark.sql.Column` that estimates
the number of distinct
+ elements in a column or a group of columns.
- .. versionchanged:: 3.4.0
- Supports Spark Connect.
+ .. versionadded:: 2.1.0
.. versionchanged:: 3.4.0
Supports Spark Connect.
Parameters
----------
col : :class:`~pyspark.sql.Column` or str
+ The label of the column to count distinct values in.
rsd : float, optional
- maximum relative standard deviation allowed (default = 0.05).
- For rsd < 0.01, it is more efficient to use :func:`count_distinct`
+ The maximum allowed relative standard deviation (default = 0.05).
+ If rsd < 0.01, it would be more efficient to use
:func:`count_distinct`.
Returns
-------
:class:`~pyspark.sql.Column`
- the column of computed results.
+ A new Column object representing the approximate unique count.
+
+ See Also
+ ----------
+ :meth:`pyspark.sql.functions.count_distinct`
Examples
--------
+ Example 1: Counting distinct values in a single column DataFrame
representing integers
+
+ >>> from pyspark.sql.functions import approx_count_distinct
>>> df = spark.createDataFrame([1,2,2,3], "INT")
>>> df.agg(approx_count_distinct("value").alias('distinct_values')).show()
+---------------+
|distinct_values|
+---------------+
| 3|
+---------------+
+
+ Example 2: Counting distinct values in a single column DataFrame
representing strings
+
+ >>> from pyspark.sql.functions import approx_count_distinct
+ >>> df = spark.createDataFrame(["apple", "orange", "apple", "banana"],
"string").toDF("fruit")
+ >>> df.agg(approx_count_distinct("fruit").alias('distinct_fruits')).show()
+ +---------------+
+ |distinct_fruits|
+ +---------------+
+ | 3|
+ +---------------+
+
+ Example 3: Counting distinct values in a DataFrame with multiple columns
+
+ >>> from pyspark.sql.functions import approx_count_distinct, struct
+ >>> df = spark.createDataFrame([("Alice", 1),
+ ... ("Alice", 2),
+ ... ("Bob", 3),
+ ... ("Bob", 3)], ["name", "value"])
+ >>> df = df.withColumn("combined", struct("name", "value"))
+ >>>
df.agg(approx_count_distinct(df["combined"]).alias('distinct_pairs')).show()
Review Comment:
can we make this example consistent with the others by using
`approx_count_distinct("combined")`?
##########
python/pyspark/sql/functions.py:
##########
@@ -3669,38 +3669,83 @@ def approxCountDistinct(col: "ColumnOrName", rsd:
Optional[float] = None) -> Col
@try_remote_functions
def approx_count_distinct(col: "ColumnOrName", rsd: Optional[float] = None) ->
Column:
- """Aggregate function: returns a new :class:`~pyspark.sql.Column` for
approximate distinct count
- of column `col`.
+ """
+ Applies an aggregate function to return an approximate distinct count of
the specified column.
- .. versionadded:: 2.1.0
+ This function returns a new :class:`~pyspark.sql.Column` that estimates
the number of distinct
+ elements in a column or a group of columns.
- .. versionchanged:: 3.4.0
- Supports Spark Connect.
+ .. versionadded:: 2.1.0
.. versionchanged:: 3.4.0
Supports Spark Connect.
Parameters
----------
col : :class:`~pyspark.sql.Column` or str
+ The label of the column to count distinct values in.
rsd : float, optional
- maximum relative standard deviation allowed (default = 0.05).
- For rsd < 0.01, it is more efficient to use :func:`count_distinct`
+ The maximum allowed relative standard deviation (default = 0.05).
+ If rsd < 0.01, it would be more efficient to use
:func:`count_distinct`.
Returns
-------
:class:`~pyspark.sql.Column`
- the column of computed results.
+ A new Column object representing the approximate unique count.
+
+ See Also
+ ----------
+ :meth:`pyspark.sql.functions.count_distinct`
Examples
--------
+ Example 1: Counting distinct values in a single column DataFrame
representing integers
+
+ >>> from pyspark.sql.functions import approx_count_distinct
>>> df = spark.createDataFrame([1,2,2,3], "INT")
>>> df.agg(approx_count_distinct("value").alias('distinct_values')).show()
+---------------+
|distinct_values|
+---------------+
| 3|
+---------------+
+
+ Example 2: Counting distinct values in a single column DataFrame
representing strings
+
+ >>> from pyspark.sql.functions import approx_count_distinct
+ >>> df = spark.createDataFrame(["apple", "orange", "apple", "banana"],
"string").toDF("fruit")
+ >>> df.agg(approx_count_distinct("fruit").alias('distinct_fruits')).show()
+ +---------------+
+ |distinct_fruits|
+ +---------------+
+ | 3|
+ +---------------+
+
+ Example 3: Counting distinct values in a DataFrame with multiple columns
+
+ >>> from pyspark.sql.functions import approx_count_distinct, struct
+ >>> df = spark.createDataFrame([("Alice", 1),
+ ... ("Alice", 2),
+ ... ("Bob", 3),
+ ... ("Bob", 3)], ["name", "value"])
+ >>> df = df.withColumn("combined", struct("name", "value"))
+ >>>
df.agg(approx_count_distinct(df["combined"]).alias('distinct_pairs')).show()
+ +--------------+
+ |distinct_pairs|
+ +--------------+
+ | 3|
+ +--------------+
+
+ Example 4: Counting distinct values with a specified relative standard
deviation
+
+ >>> from pyspark.sql.functions import approx_count_distinct
+ >>> df = spark.range(100000)
+ >>> df.agg(approx_count_distinct("id",
0.1).alias('distinct_values')).show()
Review Comment:
can we compare the results of approx_count_distinct when 1) using the
default rsd and 2) using 0.1 in this example?
##########
python/pyspark/sql/functions.py:
##########
@@ -3669,38 +3669,83 @@ def approxCountDistinct(col: "ColumnOrName", rsd:
Optional[float] = None) -> Col
@try_remote_functions
def approx_count_distinct(col: "ColumnOrName", rsd: Optional[float] = None) ->
Column:
- """Aggregate function: returns a new :class:`~pyspark.sql.Column` for
approximate distinct count
- of column `col`.
+ """
+ Applies an aggregate function to return an approximate distinct count of
the specified column.
- .. versionadded:: 2.1.0
+ This function returns a new :class:`~pyspark.sql.Column` that estimates
the number of distinct
+ elements in a column or a group of columns.
- .. versionchanged:: 3.4.0
- Supports Spark Connect.
+ .. versionadded:: 2.1.0
.. versionchanged:: 3.4.0
Supports Spark Connect.
Parameters
----------
col : :class:`~pyspark.sql.Column` or str
+ The label of the column to count distinct values in.
rsd : float, optional
- maximum relative standard deviation allowed (default = 0.05).
- For rsd < 0.01, it is more efficient to use :func:`count_distinct`
+ The maximum allowed relative standard deviation (default = 0.05).
+ If rsd < 0.01, it would be more efficient to use
:func:`count_distinct`.
Returns
-------
:class:`~pyspark.sql.Column`
- the column of computed results.
+ A new Column object representing the approximate unique count.
+
+ See Also
+ ----------
+ :meth:`pyspark.sql.functions.count_distinct`
Examples
--------
+ Example 1: Counting distinct values in a single column DataFrame
representing integers
+
+ >>> from pyspark.sql.functions import approx_count_distinct
Review Comment:
yea I would also prefer `from pyspark.sql.functions import
approx_count_distinct`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]