For those numbers of partitions, I don't think you'll actually use tree aggregation. The number of partitions needs to be over a certain threshold (>= 7) before treeAggregate really operates on a tree structure: https://github.com/apache/spark/blob/9808052b5adfed7dafd6c1b3971b998e45b2799a/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1100
Do you see a slower increase in running time with more partitions? For 5 partitions, do you find things improve if you tell treeAggregate to use depth > 2? Joseph On Wed, Oct 14, 2015 at 1:18 PM, Ulanov, Alexander <alexander.ula...@hpe.com > wrote: > Dear Spark developers, > > > > I have noticed that Gradient Descent is Spark MLlib takes long time if the > model is large. It is implemented with TreeAggregate. I’ve extracted the > code from GradientDescent.scala to perform the benchmark. It allocates the > Array of a given size and the aggregates it: > > > > val dataSize = 12000000 > > val n = 5 > > val maxIterations = 3 > > val rdd = sc.parallelize(0 until n, n).cache() > > rdd.count() > > var avgTime = 0.0 > > for (i <- 1 to maxIterations) { > > val start = System.nanoTime() > > val result = rdd.treeAggregate((new Array[Double](dataSize), 0.0, 0L))( > > seqOp = (c, v) => { > > // c: (grad, loss, count) > > val l = 0.0 > > (c._1, c._2 + l, c._3 + 1) > > }, > > combOp = (c1, c2) => { > > // c: (grad, loss, count) > > (c1._1, c1._2 + c2._2, c1._3 + c2._3) > > }) > > avgTime += (System.nanoTime() - start) / 1e9 > > assert(result._1.length == dataSize) > > } > > println("Avg time: " + avgTime / maxIterations) > > > > If I run on my cluster of 1 master and 5 workers, I get the following > results (given the array size = 12M): > > n = 1: Avg time: 4.555709667333333 > > n = 2 Avg time: 7.059724584666667 > > n = 3 Avg time: 9.937117377666667 > > n = 4 Avg time: 12.687526233 > > n = 5 Avg time: 12.939526129666667 > > > > Could you explain why the time becomes so big? The data transfer of 12M > array of double should take ~ 1 second in 1Gbit network. There might be > other overheads, however not that big as I observe. > > Best regards, Alexander >