Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1380#discussion_r14851298
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 
---
    @@ -353,9 +353,9 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
        * Group the values for each key in the RDD into a single sequence. 
Allows controlling the
        * partitioning of the resulting key-value pair RDD by passing a 
Partitioner.
        *
    -   * Note: If you are grouping in order to perform an aggregation (such as 
a sum or average) over
    -   * each key, using [[PairRDDFunctions.reduceByKey]] or 
[[PairRDDFunctions.combineByKey]]
    -   * will provide much better performance.
    +   * Note: This operation may be very expensive. If you are grouping in 
order to perform an
    --- End diff --
    
    It might be good to mention `aggregateByKey` first actually (and maybe 
remove combineByKey). I think that `reduceByKey` and `aggregateByKey` cover 
almost every use case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to