Dear Spark developers,
I have noticed that Gradient Descent is Spark MLlib takes long time if the
model is large. It is implemented with TreeAggregate. I've extracted the code
from GradientDescent.scala to perform the benchmark. It allocates the Array of
a given size and the aggregates it:
val dataSize = 12000000
val n = 5
val maxIterations = 3
val rdd = sc.parallelize(0 until n, n).cache()
rdd.count()
var avgTime = 0.0
for (i <- 1 to maxIterations) {
val start = System.nanoTime()
val result = rdd.treeAggregate((new Array[Double](dataSize), 0.0, 0L))(
seqOp = (c, v) => {
// c: (grad, loss, count)
val l = 0.0
(c._1, c._2 + l, c._3 + 1)
},
combOp = (c1, c2) => {
// c: (grad, loss, count)
(c1._1, c1._2 + c2._2, c1._3 + c2._3)
})
avgTime += (System.nanoTime() - start) / 1e9
assert(result._1.length == dataSize)
}
println("Avg time: " + avgTime / maxIterations)
If I run on my cluster of 1 master and 5 workers, I get the following results
(given the array size = 12M):
n = 1: Avg time: 4.555709667333333
n = 2 Avg time: 7.059724584666667
n = 3 Avg time: 9.937117377666667
n = 4 Avg time: 12.687526233
n = 5 Avg time: 12.939526129666667
Could you explain why the time becomes so big? The data transfer of 12M array
of double should take ~ 1 second in 1Gbit network. There might be other
overheads, however not that big as I observe.
Best regards, Alexander