[jira] [Commented] (SPARK-35622) DataFrame's count function do not need groupBy and avoid shuffle

Hyukjin Kwon (Jira) Sun, 13 Jun 2021 19:33:08 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-35622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17362683#comment-17362683
 ]


Hyukjin Kwon commented on SPARK-35622:
--------------------------------------

IIRC, it already works same or similarly with RDD's count. If 
df.groupby().count() does an extra shuffle, we might better add an optimizer 
rule to optimize this pattern. DataFrame should better use Spark SQL plans so 
we can further optimize it in more complicated pattern.

[~xiepengjie], it would be great to show the actual performance difference if 
you observe the performance difference.

> DataFrame's count function do not need groupBy and avoid shuffle
> ----------------------------------------------------------------
>
>                 Key: SPARK-35622
>                 URL: https://issues.apache.org/jira/browse/SPARK-35622
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.1.2
>            Reporter: xiepengjie
>            Priority: Major
>
> Use `df.rdd.count()` replace `df.count()`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-35622) DataFrame's count function do not need groupBy and avoid shuffle

Reply via email to