[jira] [Commented] (SPARK-34464) `first` function is sorting the dataset while sometimes it is used to get "any value"

Pablo Langa Blanco (Jira) Tue, 06 Apr 2021 11:55:26 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-34464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17315782#comment-17315782
 ]


Pablo Langa Blanco commented on SPARK-34464:
--------------------------------------------

Hi [~lfruleux] ,

Thank you for reporting it! I think it's a good point to try to improve. Yeah, 
Last and other functions are in the same situation, so if we go forward with 
this changes we could apply the same to this functions. All help is welcome! :D

> `first` function is sorting the dataset while sometimes it is used to get 
> "any value"
> -------------------------------------------------------------------------------------
>
>                 Key: SPARK-34464
>                 URL: https://issues.apache.org/jira/browse/SPARK-34464
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Louis Fruleux
>            Priority: Minor
>
> When one wants to groupBy and take any value (not necessarily the first), one 
> usually uses 
> [first|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L485]
>  aggregation function.
> Unfortunately, this method uses a `SortAggregate` for some data types, which 
> is not always necessary and might impact performances. Is this the desired 
> behavior?
>  
>  
> {code:java}
> Current behavior:
>  val df = Seq((0, "value")).toDF("key", "value")
> df.groupBy("key").agg(first("value")).explain()
>  /*
>  == Physical Plan ==
>  SortAggregate(key=key#342, functions=first(value#343, false))
>  +- *(2) Sort key#342 ASC NULLS FIRST, false, 0
>     +- Exchange hashpartitioning(key#342, 200)
>        +- SortAggregate(key=key#342, functions=partial_first(value#343, 
> false))
>           +- *(1) Sort key#342 ASC NULLS FIRST, false, 0
>              +- LocalTableScan key#342, value#343
>  */
> {code}
>  
> My understanding of the source code does not allow me to fully understand why 
> this is the current behavior.
> The solution might be to implement a new aggregate function. But the code 
> would be highly similar to the first one. And if I don't fully understand why 
> this 
> [createAggregate|https://github.com/apache/spark/blob/3a299aa6480ac22501512cd0310d31a441d7dfdc/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala#L45]
>  method falls back to SortAggregate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-34464) `first` function is sorting the dataset while sometimes it is used to get "any value"

Reply via email to