Hi all, I'm benchmarking Logistic Regression in MLlib using the newly added optimizer LBFGS and GD. I'm using the same dataset and the same methodology in this paper, http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf
I want to know how Spark scale while adding workers, and how optimizers and input format (sparse or dense) impact performance. The benchmark code can be found here, https://github.com/dbtsai/spark-lbfgs-benchmark The first dataset I benchmarked is a9a which only has 2.2MB. I duplicated the dataset, and made it 762MB to have 11M rows. This dataset has 123 features and 11% of the data are non-zero elements. In this benchmark, all the dataset is cached in memory. As we expect, LBFGS converges faster than GD, and at some point, no matter how we push GD, it will converge slower and slower. However, it's surprising that sparse format runs slower than dense format. I did see that sparse format takes significantly smaller amount of memory in caching RDD, but sparse is 40% slower than dense. I think sparse should be fast since when we compute x wT, since x is sparse, we can do it faster. I wonder if there is anything I'm doing wrong. The attachment is the benchmark result. Thanks. Sincerely, DB Tsai ------------------------------------------------------- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai