Hi all,

I'm benchmarking Logistic Regression in MLlib using the newly added
optimizer LBFGS and GD. I'm using the same dataset and the same methodology
in this paper, http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf

I want to know how Spark scale while adding workers, and how optimizers and
input format (sparse or dense) impact performance.

The benchmark code can be found here,
https://github.com/dbtsai/spark-lbfgs-benchmark

The first dataset I benchmarked is a9a which only has 2.2MB. I duplicated
the dataset, and made it 762MB to have 11M rows. This dataset has 123
features and 11% of the data are non-zero elements.

In this benchmark, all the dataset is cached in memory.

As we expect, LBFGS converges faster than GD, and at some point, no matter
how we push GD, it will converge slower and slower.

However, it's surprising that sparse format runs slower than dense format.
I did see that sparse format takes significantly smaller amount of memory
in caching RDD, but sparse is 40% slower than dense. I think sparse should
be fast since when we compute x wT, since x is sparse, we can do it faster.
I wonder if there is anything I'm doing wrong.

The attachment is the benchmark result.

Thanks.

Sincerely,

DB Tsai
-------------------------------------------------------
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai

Reply via email to