Hi, I am working on artificial neural networks for Spark. It is solved with Gradient Descent, so each step the data is read, sum of gradients is calculated for each data partition (on each worker), aggregated (on the driver) and broadcasted back. I noticed that the gradient computation time is few times less than the total time needed for each step. To narrow down my observation, I run the gradient on a single machine with single partition of data of site 100MB that I persist (data.persist). This should minimize the overhead for aggregation at least, but the gradient computation still takes much less time than the whole step. Just in case, data is loaded by MLUtil. loadLibSVMFile in RDD[LabeledPoint], this is my code:
val conf = new SparkConf().setAppName("myApp").setMaster("local[2]") val train = MLUtils.loadLibSVMFile(new SparkContext(conf), "/data/mnist/mnist.scale").repartition(1).persist() val model = ANN2Classifier.train(train, 1000, Array[Int](32), 10, 1e-4) //training data, batch size, hidden layer size, iterations, LBFGS tolerance Profiler shows that there are two threads, one is doing Gradient and the other I don't know what. The Gradient takes 10% of this thread. Almost all other time is spent by MemoryStore. Below is the screenshot (first thread): https://drive.google.com/file/d/0BzYMzvDiCep5bGp2S2F6eE9TRlk/view?usp=sharing Second thread: https://drive.google.com/file/d/0BzYMzvDiCep5OHA0WUtQbXd3WmM/view?usp=sharing Could Spark developers please elaborate what's going on in MemoryStore? It seems that it does some string operations (parsing libsvm file? Why every step?) and a lot of InputStream reading. It seems that the overall time depends on the size of the data batch (or size of vector) I am processing. However it does not seems linear to me. Also, I would like to know how to speedup these operations. Best regards, Alexander