Repository: spark Updated Branches: refs/heads/master c8e934ef3 -> 406f6d307
SPARK-5458. Refer to aggregateByKey instead of combineByKey in docs Author: Sandy Ryza <[email protected]> Closes #4251 from sryza/sandy-spark-5458 and squashes the following commits: 460827a [Sandy Ryza] Python too d2dc160 [Sandy Ryza] SPARK-5458. Refer to aggregateByKey instead of combineByKey in docs Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/406f6d30 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/406f6d30 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/406f6d30 Branch: refs/heads/master Commit: 406f6d3070441962222f6a25449ea2c48f52ce88 Parents: c8e934e Author: Sandy Ryza <[email protected]> Authored: Wed Jan 28 12:41:23 2015 -0800 Committer: Patrick Wendell <[email protected]> Committed: Wed Jan 28 12:41:23 2015 -0800 ---------------------------------------------------------------------- docs/programming-guide.md | 2 +- python/pyspark/rdd.py | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/spark/blob/406f6d30/docs/programming-guide.md ---------------------------------------------------------------------- diff --git a/docs/programming-guide.md b/docs/programming-guide.md index 2443fc2..6486614 100644 --- a/docs/programming-guide.md +++ b/docs/programming-guide.md @@ -886,7 +886,7 @@ for details. <td> <b>groupByKey</b>([<i>numTasks</i>]) </td> <td> When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs. <br /> <b>Note:</b> If you are grouping in order to perform an aggregation (such as a sum or - average) over each key, using <code>reduceByKey</code> or <code>combineByKey</code> will yield much better + average) over each key, using <code>reduceByKey</code> or <code>aggregateByKey</code> will yield much better performance. <br /> <b>Note:</b> By default, the level of parallelism in the output depends on the number of partitions of the parent RDD. http://git-wip-us.apache.org/repos/asf/spark/blob/406f6d30/python/pyspark/rdd.py ---------------------------------------------------------------------- diff --git a/python/pyspark/rdd.py b/python/pyspark/rdd.py index f4cfe48..efd2f35 100644 --- a/python/pyspark/rdd.py +++ b/python/pyspark/rdd.py @@ -1634,8 +1634,8 @@ class RDD(object): Hash-partitions the resulting RDD with into numPartitions partitions. Note: If you are grouping in order to perform an aggregation (such as a - sum or average) over each key, using reduceByKey will provide much - better performance. + sum or average) over each key, using reduceByKey or aggregateByKey will + provide much better performance. >>> x = sc.parallelize([("a", 1), ("b", 1), ("a", 1)]) >>> map((lambda (x,y): (x, list(y))), sorted(x.groupByKey().collect())) --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
