Re: [PR] [SPARK-48728][SQL] Support ignoreNulls for collect_list and collect_set [spark]

via GitHub Mon, 23 Dec 2024 01:22:14 -0800


cloud-fan commented on code in PR #47149:
URL: https://github.com/apache/spark/pull/47149#discussion_r1895515932



##########
sql/api/src/main/scala/org/apache/spark/sql/functions.scala:
##########
@@ -354,7 +354,23 @@ object functions {
   def collect_list(columnName: String): Column = 
collect_list(Column(columnName))
 
   /**
-   * Aggregate function: returns a set of objects with duplicate elements 
eliminated.
+   * Aggregate function: returns a list of objects with duplicates.
+   *
+   * The parameter ignoreNulls controls if nulls should be excluded from the 
result.
+   *
+   * @note
+   *   The function is non-deterministic because the order of collected 
results depends on the
+   *   order of the rows which may be non-deterministic after a shuffle.
+   *
+   * @group agg_funcs
+   * @since 4.0.0
+   */
+  def collect_list(e: Column, ignoreNulls: Column): Column =

Review Comment:
   Since the SQL syntax requires users to explicitly specify IGNORE/RESPECT 
NULLS, this `ignoreNull` flag must be a constant. Do we really want to allow 
`Column` as the `ignoreNull` parameter?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-48728][SQL] Support ignoreNulls for collect_list and collect_set [spark]

Reply via email to