[GitHub] [spark] itholic commented on a diff in pull request #37662: [SPARK-40142][PYTHON][SQL][FOLLOW-UP] Make pyspark.sql.functions examples self-contained (part 3, 28 functions)

GitBox Fri, 26 Aug 2022 01:55:02 -0700


itholic commented on code in PR #37662:
URL: https://github.com/apache/spark/pull/37662#discussion_r955805923



##########
python/pyspark/sql/functions.py:
##########
@@ -2189,6 +2355,25 @@ def broadcast(df: DataFrame) -> DataFrame:
     Marks a DataFrame as small enough for use in broadcast joins.
 
     .. versionadded:: 1.6.0
+
+    Returns
+    -------
+    :class:`~pyspark.sql.DataFrame`
+        DataFrame marked as ready for broadcast join.
+
+    Examples
+    --------
+    >>> from pyspark.sql import types
+    >>> df = spark.createDataFrame([1, 2, 3, 3, 4], types.IntegerType())
+    >>> df_small = spark.range(3)
+    >>> df_b = broadcast(df_small)
+    >>> df.join(df_b, df.value == df_small.id).show()

Review Comment:
   What about using `explain(True)` to explicitly show the `broadcast` is used 
as a strategy for `ResolvedHint` ??
   
   ```python
   >>> df.join(df_b, df.value == df_small.id).explain(True)
   == Parsed Logical Plan ==
   Join Inner, (cast(value#267 as bigint) = id#269L)
   :- LogicalRDD [value#267], false
   +- ResolvedHint (strategy=broadcast)
      +- Range (0, 3, step=1, splits=Some(16))
   
   == Analyzed Logical Plan ==
   value: int, id: bigint
   Join Inner, (cast(value#267 as bigint) = id#269L)
   :- LogicalRDD [value#267], false
   +- ResolvedHint (strategy=broadcast)
      +- Range (0, 3, step=1, splits=Some(16))
   
   == Optimized Logical Plan ==
   Join Inner, (cast(value#267 as bigint) = id#269L), 
rightHint=(strategy=broadcast)
   :- Filter isnotnull(value#267)
   :  +- LogicalRDD [value#267], false
   +- Range (0, 3, step=1, splits=Some(16))
   
   == Physical Plan ==
   AdaptiveSparkPlan isFinalPlan=false
   +- BroadcastHashJoin [cast(value#267 as bigint)], [id#269L], Inner, 
BuildRight, false
      :- Filter isnotnull(value#267)
      :  +- Scan ExistingRDD[value#267]
      +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, 
false]),false), [plan_id=164]
         +- Range (0, 3, step=1, splits=16)
   ```



##########
python/pyspark/sql/functions.py:
##########
@@ -2440,6 +2804,38 @@ def last(col: "ColumnOrName", ignorenulls: bool = False) 
-> Column:
     -----
     The function is non-deterministic because its results depends on the order 
of the
     rows which may be non-deterministic after a shuffle.
+
+    Parameters
+    ----------
+    col : :class:`~pyspark.sql.Column` or str
+        column to fetch last value for.
+    ignorenulls : :class:`~pyspark.sql.Column` or str
+        if last value is null then look for non-null value.
+
+    Returns
+    -------
+    :class:`~pyspark.sql.Column`
+        last value of the group.
+
+    Examples
+    --------
+    >>> df = spark.createDataFrame([("Alice", 2), ("Bob", 5), ("Alice", 
None)], ("name", "age"))
+    >>> df = df.orderBy(df.age.desc())
+    >>> df.groupby("name").agg(last("age")).orderBy("name").show()
+    +-----+---------+
+    | name|last(age)|
+    +-----+---------+
+    |Alice|     null|
+    |  Bob|        5|
+    +-----+---------+
+
+    >>> df.groupby("name").agg(last("age", True)).orderBy("name").show()

Review Comment:
   Here, also can we add a simple description why we're setting the 
`ignorenulls` as `True` ?



##########
python/pyspark/sql/functions.py:
##########
@@ -2377,6 +2672,16 @@ def grouping_id(*cols: "ColumnOrName") -> Column:
     The list of columns should match with grouping columns exactly, or empty 
(means all
     the grouping columns).
 
+    Parameters
+    ----------
+    cols : :class:`~pyspark.sql.Column` or str
+        columns to check for.
+
+    Returns
+    -------
+    :class:`~pyspark.sql.Column`
+        returns level of the grouping it relates to.
+
     Examples
     --------
     >>> df.cube("name").agg(grouping_id(), sum("age")).orderBy("name").show()

Review Comment:
   Here, also we can just remove the existing one since now we have improved 
one ?



##########
python/pyspark/sql/functions.py:
##########
@@ -2301,13 +2532,46 @@ def count_distinct(col: "ColumnOrName", *cols: 
"ColumnOrName") -> Column:
 
     .. versionadded:: 3.2.0
 
+    Parameters
+    ----------
+    col : :class:`~pyspark.sql.Column` or str
+        first column to compute on.
+    cols : :class:`~pyspark.sql.Column` or str
+        other columns to compute on.
+
+    Returns
+    -------
+    :class:`~pyspark.sql.Column`
+        distinct values of these two column values.
+
     Examples
     --------
     >>> df.agg(count_distinct(df.age, df.name).alias('c')).collect()
     [Row(c=2)]
 
     >>> df.agg(count_distinct("age", "name").alias('c')).collect()
     [Row(c=2)]

Review Comment:
   May be we can just remove the existing example since now we have a better 
one ?



##########
python/pyspark/sql/functions.py:
##########
@@ -2329,13 +2593,34 @@ def first(col: "ColumnOrName", ignorenulls: bool = 
False) -> Column:
     The function is non-deterministic because its results depends on the order 
of the
     rows which may be non-deterministic after a shuffle.
 
+    Parameters
+    ----------
+    col : :class:`~pyspark.sql.Column` or str
+        column to fetch first value for.
+    ignorenulls : :class:`~pyspark.sql.Column` or str
+        if first value is null then look for first non-null value.
+
+    Returns
+    -------
+    :class:`~pyspark.sql.Column`
+        first value of the group.
+
     Examples
     --------
-    >>> df = spark.createDataFrame([("Alice", 2), ("Bob", 5)], ("name", "age"))
+    >>> df = spark.createDataFrame([("Alice", 2), ("Bob", 5), ("Alice", 
None)], ("name", "age"))
+    >>> df = df.orderBy(df.age)
     >>> df.groupby("name").agg(first("age")).orderBy("name").show()
     +-----+----------+
     | name|first(age)|
     +-----+----------+
+    |Alice|      null|
+    |  Bob|         5|
+    +-----+----------+
+
+    >>> df.groupby("name").agg(first("age", True)).orderBy("name").show()

Review Comment:
   Can we add a short description for this example why here we set the 
`ignorenulls` as `True` ??



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] itholic commented on a diff in pull request #37662: [SPARK-40142][PYTHON][SQL][FOLLOW-UP] Make pyspark.sql.functions examples self-contained (part 3, 28 functions)

Reply via email to