Repository: spark
Updated Branches:
  refs/heads/master 0760787da -> 538f22162


Document that groupByKey will OOM for large keys

This pull request is my own work and I license it under Spark's open-source 
license.

This contribution is an improvement to the documentation. I documented that the 
maximum number of values per key for groupByKey is limited by available RAM 
(see [Datablox][datablox link] and [the spark mailing list][list link]).

Just saying that better performance is available is not sufficient. Sometimes 
you need to do a group-by - your operation needs all the items available in 
order to complete. This warning explains the problem.

[datablox link]: 
http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html
[list link]: 
http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-RDD-GroupBy-OutOfMemory-Exceptions-tp11427p11466.html

Author: Eric Moyer <[email protected]>

Closes #3936 from RadixSeven/better-group-by-docs and squashes the following 
commits:

5b6f4e9 [Eric Moyer] groupByKey docs naming updates
238e81b [Eric Moyer] Doc that groupByKey will OOM for large keys


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/538f2216
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/538f2216
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/538f2216

Branch: refs/heads/master
Commit: 538f221627930c8f8a138c0d21d9fa09bc789e67
Parents: 0760787
Author: Eric Moyer <[email protected]>
Authored: Thu Jan 8 11:55:23 2015 -0800
Committer: Andrew Or <[email protected]>
Committed: Thu Jan 8 11:55:23 2015 -0800

----------------------------------------------------------------------
 .../src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala | 6 ++++++
 1 file changed, 6 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/538f2216/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala
----------------------------------------------------------------------
diff --git a/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 
b/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala
index f8df5b2..38f8f36 100644
--- a/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala
+++ b/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala
@@ -437,6 +437,9 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
    * Note: This operation may be very expensive. If you are grouping in order 
to perform an
    * aggregation (such as a sum or average) over each key, using 
[[PairRDDFunctions.aggregateByKey]]
    * or [[PairRDDFunctions.reduceByKey]] will provide much better performance.
+   *
+   * Note: As currently implemented, groupByKey must be able to hold all the 
key-value pairs for any
+   * key in memory. If a key has too many values, it can result in an 
[[OutOfMemoryError]].
    */
   def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = {
     // groupByKey shouldn't use map side combine because map side combine does 
not
@@ -458,6 +461,9 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
    * Note: This operation may be very expensive. If you are grouping in order 
to perform an
    * aggregation (such as a sum or average) over each key, using 
[[PairRDDFunctions.aggregateByKey]]
    * or [[PairRDDFunctions.reduceByKey]] will provide much better performance.
+   *
+   * Note: As currently implemented, groupByKey must be able to hold all the 
key-value pairs for any
+   * key in memory. If a key has too many values, it can result in an 
[[OutOfMemoryError]].
    */
   def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])] = {
     groupByKey(new HashPartitioner(numPartitions))


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to