[
https://issues.apache.org/jira/browse/SPARK-35622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17362683#comment-17362683
]
Hyukjin Kwon commented on SPARK-35622:
--------------------------------------
IIRC, it already works same or similarly with RDD's count. If
df.groupby().count() does an extra shuffle, we might better add an optimizer
rule to optimize this pattern. DataFrame should better use Spark SQL plans so
we can further optimize it in more complicated pattern.
[~xiepengjie], it would be great to show the actual performance difference if
you observe the performance difference.
> DataFrame's count function do not need groupBy and avoid shuffle
> ----------------------------------------------------------------
>
> Key: SPARK-35622
> URL: https://issues.apache.org/jira/browse/SPARK-35622
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 3.1.2
> Reporter: xiepengjie
> Priority: Major
>
> Use `df.rdd.count()` replace `df.count()`.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]