Hi,
I am working on artificial neural networks for Spark. It is solved with
Gradient Descent, so each step the data is read, sum of gradients is calculated
for each data partition (on each worker), aggregated (on the driver) and
broadcasted back. I noticed that the gradient computation time is few times
less than the total time needed for each step. To narrow down my observation, I
run the gradient on a single machine with single partition of data of site
100MB that I persist (data.persist). This should minimize the overhead for
aggregation at least, but the gradient computation still takes much less time
than the whole step. Just in case, data is loaded by MLUtil. loadLibSVMFile in
RDD[LabeledPoint], this is my code:
val conf = new SparkConf().setAppName("myApp").setMaster("local[2]")
val train = MLUtils.loadLibSVMFile(new SparkContext(conf),
"/data/mnist/mnist.scale").repartition(1).persist()
val model = ANN2Classifier.train(train, 1000, Array[Int](32), 10, 1e-4)
//training data, batch size, hidden layer size, iterations, LBFGS tolerance
Profiler shows that there are two threads, one is doing Gradient and the other
I don't know what. The Gradient takes 10% of this thread. Almost all other time
is spent by MemoryStore. Below is the screenshot (first thread):
https://drive.google.com/file/d/0BzYMzvDiCep5bGp2S2F6eE9TRlk/view?usp=sharing
Second thread:
https://drive.google.com/file/d/0BzYMzvDiCep5OHA0WUtQbXd3WmM/view?usp=sharing
Could Spark developers please elaborate what's going on in MemoryStore? It
seems that it does some string operations (parsing libsvm file? Why every
step?) and a lot of InputStream reading. It seems that the overall time depends
on the size of the data batch (or size of vector) I am processing. However it
does not seems linear to me.
Also, I would like to know how to speedup these operations.
Best regards, Alexander