Repository: spark
Updated Branches:
  refs/heads/branch-1.0 2ec7d7ab7 -> 0e2727959


Add/increase severity of warning in documentation of groupBy()

groupBy()/groupByKey() is notorious for being a very convenient API that can 
lead to poor performance when used incorrectly.

This PR just makes it clear that users should be cautious not to rely on this 
API when they really want a different (more performant) one, such as 
reduceByKey().

(Note that one source of confusion is the name; this groupBy() is not the same 
as a SQL GROUP-BY, which is used for aggregation and is more similar in nature 
to Spark's reduceByKey().)

Author: Aaron Davidson <[email protected]>

Closes #1380 from aarondav/warning and squashes the following commits:

f60da39 [Aaron Davidson] Give better advice
d0afb68 [Aaron Davidson] Add/increase severity of warning in documentation of 
groupBy()
(cherry picked from commit a2aa7bebae31e1e7ec23d31aaa436283743b283b)

Signed-off-by: Patrick Wendell <[email protected]>


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0e272795
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0e272795
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0e272795

Branch: refs/heads/branch-1.0
Commit: 0e2727959a4c2eac41bb6ec70054a1e467637099
Parents: 2ec7d7a
Author: Aaron Davidson <[email protected]>
Authored: Mon Jul 14 23:38:12 2014 -0700
Committer: Patrick Wendell <[email protected]>
Committed: Mon Jul 14 23:38:24 2014 -0700

----------------------------------------------------------------------
 .../org/apache/spark/rdd/PairRDDFunctions.scala   | 18 +++++++++---------
 .../src/main/scala/org/apache/spark/rdd/RDD.scala | 12 ++++++++++++
 2 files changed, 21 insertions(+), 9 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/0e272795/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala
----------------------------------------------------------------------
diff --git a/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 
b/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala
index c405641..0d3793d 100644
--- a/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala
+++ b/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala
@@ -265,9 +265,9 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
    * Group the values for each key in the RDD into a single sequence. Allows 
controlling the
    * partitioning of the resulting key-value pair RDD by passing a Partitioner.
    *
-   * Note: If you are grouping in order to perform an aggregation (such as a 
sum or average) over
-   * each key, using [[PairRDDFunctions.reduceByKey]] or 
[[PairRDDFunctions.combineByKey]]
-   * will provide much better performance.
+   * Note: This operation may be very expensive. If you are grouping in order 
to perform an
+   * aggregation (such as a sum or average) over each key, using 
[[PairRDDFunctions.aggregateByKey]]
+   * or [[PairRDDFunctions.reduceByKey]] will provide much better performance.
    */
   def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = {
     // groupByKey shouldn't use map side combine because map side combine does 
not
@@ -285,9 +285,9 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
    * Group the values for each key in the RDD into a single sequence. 
Hash-partitions the
    * resulting RDD with into `numPartitions` partitions.
    *
-   * Note: If you are grouping in order to perform an aggregation (such as a 
sum or average) over
-   * each key, using [[PairRDDFunctions.reduceByKey]] or 
[[PairRDDFunctions.combineByKey]]
-   * will provide much better performance.
+   * Note: This operation may be very expensive. If you are grouping in order 
to perform an
+   * aggregation (such as a sum or average) over each key, using 
[[PairRDDFunctions.aggregateByKey]]
+   * or [[PairRDDFunctions.reduceByKey]] will provide much better performance.
    */
   def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])] = {
     groupByKey(new HashPartitioner(numPartitions))
@@ -374,9 +374,9 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
    * Group the values for each key in the RDD into a single sequence. 
Hash-partitions the
    * resulting RDD with the existing partitioner/parallelism level.
    *
-   * Note: If you are grouping in order to perform an aggregation (such as a 
sum or average) over
-   * each key, using [[PairRDDFunctions.reduceByKey]] or 
[[PairRDDFunctions.combineByKey]]
-   * will provide much better performance,
+   * Note: This operation may be very expensive. If you are grouping in order 
to perform an
+   * aggregation (such as a sum or average) over each key, using 
[[PairRDDFunctions.aggregateByKey]]
+   * or [[PairRDDFunctions.reduceByKey]] will provide much better performance.
    */
   def groupByKey(): RDD[(K, Iterable[V])] = {
     groupByKey(defaultPartitioner(self))

http://git-wip-us.apache.org/repos/asf/spark/blob/0e272795/core/src/main/scala/org/apache/spark/rdd/RDD.scala
----------------------------------------------------------------------
diff --git a/core/src/main/scala/org/apache/spark/rdd/RDD.scala 
b/core/src/main/scala/org/apache/spark/rdd/RDD.scala
index a2868e9..e036c53 100644
--- a/core/src/main/scala/org/apache/spark/rdd/RDD.scala
+++ b/core/src/main/scala/org/apache/spark/rdd/RDD.scala
@@ -488,6 +488,10 @@ abstract class RDD[T: ClassTag](
   /**
    * Return an RDD of grouped items. Each group consists of a key and a 
sequence of elements
    * mapping to that key.
+   *
+   * Note: This operation may be very expensive. If you are grouping in order 
to perform an
+   * aggregation (such as a sum or average) over each key, using 
[[PairRDDFunctions.aggregateByKey]]
+   * or [[PairRDDFunctions.reduceByKey]] will provide much better performance.
    */
   def groupBy[K](f: T => K)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])] =
     groupBy[K](f, defaultPartitioner(this))
@@ -495,6 +499,10 @@ abstract class RDD[T: ClassTag](
   /**
    * Return an RDD of grouped elements. Each group consists of a key and a 
sequence of elements
    * mapping to that key.
+   *
+   * Note: This operation may be very expensive. If you are grouping in order 
to perform an
+   * aggregation (such as a sum or average) over each key, using 
[[PairRDDFunctions.aggregateByKey]]
+   * or [[PairRDDFunctions.reduceByKey]] will provide much better performance.
    */
   def groupBy[K](f: T => K, numPartitions: Int)(implicit kt: ClassTag[K]): 
RDD[(K, Iterable[T])] =
     groupBy(f, new HashPartitioner(numPartitions))
@@ -502,6 +510,10 @@ abstract class RDD[T: ClassTag](
   /**
    * Return an RDD of grouped items. Each group consists of a key and a 
sequence of elements
    * mapping to that key.
+   *
+   * Note: This operation may be very expensive. If you are grouping in order 
to perform an
+   * aggregation (such as a sum or average) over each key, using 
[[PairRDDFunctions.aggregateByKey]]
+   * or [[PairRDDFunctions.reduceByKey]] will provide much better performance.
    */
   def groupBy[K](f: T => K, p: Partitioner)(implicit kt: ClassTag[K], ord: 
Ordering[K] = null)
       : RDD[(K, Iterable[T])] = {

Reply via email to