spark git commit: SPARK-5458. Refer to aggregateByKey instead of combineByKey in docs

pwendell Wed, 28 Jan 2015 12:42:19 -0800

Repository: spark
Updated Branches:
  refs/heads/master c8e934ef3 -> 406f6d307



SPARK-5458. Refer to aggregateByKey instead of combineByKey in docs

Author: Sandy Ryza <[email protected]>

Closes #4251 from sryza/sandy-spark-5458 and squashes the following commits:

460827a [Sandy Ryza] Python too
d2dc160 [Sandy Ryza] SPARK-5458. Refer to aggregateByKey instead of 
combineByKey in docs


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/406f6d30
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/406f6d30
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/406f6d30

Branch: refs/heads/master
Commit: 406f6d3070441962222f6a25449ea2c48f52ce88
Parents: c8e934e
Author: Sandy Ryza <[email protected]>
Authored: Wed Jan 28 12:41:23 2015 -0800
Committer: Patrick Wendell <[email protected]>
Committed: Wed Jan 28 12:41:23 2015 -0800

----------------------------------------------------------------------
 docs/programming-guide.md | 2 +-
 python/pyspark/rdd.py     | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/406f6d30/docs/programming-guide.md
----------------------------------------------------------------------
diff --git a/docs/programming-guide.md b/docs/programming-guide.md
index 2443fc2..6486614 100644
--- a/docs/programming-guide.md
+++ b/docs/programming-guide.md
@@ -886,7 +886,7 @@ for details.
   <td> <b>groupByKey</b>([<i>numTasks</i>]) </td>
   <td> When called on a dataset of (K, V) pairs, returns a dataset of (K, 
Iterable&lt;V&gt;) pairs. <br />
     <b>Note:</b> If you are grouping in order to perform an aggregation (such 
as a sum or
-      average) over each key, using <code>reduceByKey</code> or 
<code>combineByKey</code> will yield much better 
+      average) over each key, using <code>reduceByKey</code> or 
<code>aggregateByKey</code> will yield much better 
       performance.
     <br />
     <b>Note:</b> By default, the level of parallelism in the output depends on 
the number of partitions of the parent RDD.

http://git-wip-us.apache.org/repos/asf/spark/blob/406f6d30/python/pyspark/rdd.py
----------------------------------------------------------------------
diff --git a/python/pyspark/rdd.py b/python/pyspark/rdd.py
index f4cfe48..efd2f35 100644
--- a/python/pyspark/rdd.py
+++ b/python/pyspark/rdd.py
@@ -1634,8 +1634,8 @@ class RDD(object):
         Hash-partitions the resulting RDD with into numPartitions partitions.
 
         Note: If you are grouping in order to perform an aggregation (such as a
-        sum or average) over each key, using reduceByKey will provide much
-        better performance.
+        sum or average) over each key, using reduceByKey or aggregateByKey will
+        provide much better performance.
 
         >>> x = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
         >>> map((lambda (x,y): (x, list(y))), sorted(x.groupByKey().collect()))


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

spark git commit: SPARK-5458. Refer to aggregateByKey instead of combineByKey in docs

Reply via email to