git commit: misleading task number of groupByKey

rxin Wed, 16 Apr 2014 17:59:27 -0700

Repository: spark
Updated Branches:
  refs/heads/master 38877ccf3 -> 9c40b9ead



misleading task number of groupByKey

"By default, this uses only 8 parallel tasks to do the grouping." is a big 
misleading. Please refer to https://github.com/apache/spark/pull/389

detail is as following code :

  def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {
    val bySize = (Seq(rdd) ++ others).sortBy(_.partitions.size).reverse
    for (r <- bySize if r.partitioner.isDefined) {
      return r.partitioner.get
    }
    if (rdd.context.conf.contains("spark.default.parallelism")) {
      new HashPartitioner(rdd.context.defaultParallelism)
    } else {
      new HashPartitioner(bySize.head.partitions.size)
    }
  }

Author: Chen Chao <crazy...@gmail.com>

Closes #403 from CrazyJvm/patch-4 and squashes the following commits:

42f6c9e [Chen Chao] fix format
829a995 [Chen Chao] fix format
1568336 [Chen Chao] misleading task number of groupByKey


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9c40b9ea
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9c40b9ea
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9c40b9ea

Branch: refs/heads/master
Commit: 9c40b9ead0d17ad836b3507c701198645c33d878
Parents: 38877cc
Author: Chen Chao <crazy...@gmail.com>
Authored: Wed Apr 16 17:58:42 2014 -0700
Committer: Reynold Xin <r...@apache.org>
Committed: Wed Apr 16 17:58:42 2014 -0700

----------------------------------------------------------------------
 docs/scala-programming-guide.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/9c40b9ea/docs/scala-programming-guide.md
----------------------------------------------------------------------
diff --git a/docs/scala-programming-guide.md b/docs/scala-programming-guide.md
index a07cd2e..2b0a51e 100644
--- a/docs/scala-programming-guide.md
+++ b/docs/scala-programming-guide.md
@@ -189,8 +189,8 @@ The following tables list the transformations and actions 
currently supported (s
 <tr>
   <td> <b>groupByKey</b>([<i>numTasks</i>]) </td>
   <td> When called on a dataset of (K, V) pairs, returns a dataset of (K, 
Seq[V]) pairs. <br />
-<b>Note:</b> By default, this uses only 8 parallel tasks to do the grouping. 
You can pass an optional <code>numTasks</code> argument to set a different 
number of tasks.
-</td>
+<b>Note:</b> By default, if the RDD already has a partitioner, the task number 
is decided by the partition number of the partitioner, or else relies on the 
value of <code>spark.default.parallelism</code> if the property is set , 
otherwise depends on the partition number of the RDD. You can pass an optional 
<code>numTasks</code> argument to set a different number of tasks.
+  </td>
 </tr>
 <tr>
   <td> <b>reduceByKey</b>(<i>func</i>, [<i>numTasks</i>]) </td>

git commit: misleading task number of groupByKey

Reply via email to