In mllib, the weight, and gradient are dense. Only feature is sparse. Sincerely,
DB Tsai ------------------------------------------------------- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Wed, Apr 23, 2014 at 10:16 PM, David Hall <d...@cs.berkeley.edu> wrote: > Was the weight vector sparse? The gradients? Or just the feature vectors? > > > On Wed, Apr 23, 2014 at 10:08 PM, DB Tsai <dbt...@dbtsai.com> wrote: >> >> The figure showing the Log-Likelihood vs Time can be found here. >> >> >> https://github.com/dbtsai/spark-lbfgs-benchmark/raw/fd703303fb1c16ef5714901739154728550becf4/result/a9a11M.pdf >> >> Let me know if you can not open it. >> >> Sincerely, >> >> DB Tsai >> ------------------------------------------------------- >> My Blog: https://www.dbtsai.com >> LinkedIn: https://www.linkedin.com/in/dbtsai >> >> >> On Wed, Apr 23, 2014 at 9:34 PM, Shivaram Venkataraman < >> shiva...@eecs.berkeley.edu> wrote: >> >> > I don't think the attachment came through in the list. Could you upload >> > the results somewhere and link to them ? >> > >> > >> > On Wed, Apr 23, 2014 at 9:32 PM, DB Tsai <dbt...@dbtsai.com> wrote: >> > >> >> 123 features per rows, and in average, 89% are zeros. >> >> On Apr 23, 2014 9:31 PM, "Evan Sparks" <evan.spa...@gmail.com> wrote: >> >> >> >> > What is the number of non zeroes per row (and number of features) in >> >> > the >> >> > sparse case? We've hit some issues with breeze sparse support in the >> >> past >> >> > but for sufficiently sparse data it's still pretty good. >> >> > >> >> > > On Apr 23, 2014, at 9:21 PM, DB Tsai <dbt...@stanford.edu> wrote: >> >> > > >> >> > > Hi all, >> >> > > >> >> > > I'm benchmarking Logistic Regression in MLlib using the newly added >> >> > optimizer LBFGS and GD. I'm using the same dataset and the same >> >> methodology >> >> > in this paper, http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf >> >> > > >> >> > > I want to know how Spark scale while adding workers, and how >> >> optimizers >> >> > and input format (sparse or dense) impact performance. >> >> > > >> >> > > The benchmark code can be found here, >> >> > https://github.com/dbtsai/spark-lbfgs-benchmark >> >> > > >> >> > > The first dataset I benchmarked is a9a which only has 2.2MB. I >> >> > duplicated the dataset, and made it 762MB to have 11M rows. This >> >> > dataset >> >> > has 123 features and 11% of the data are non-zero elements. >> >> > > >> >> > > In this benchmark, all the dataset is cached in memory. >> >> > > >> >> > > As we expect, LBFGS converges faster than GD, and at some point, no >> >> > matter how we push GD, it will converge slower and slower. >> >> > > >> >> > > However, it's surprising that sparse format runs slower than dense >> >> > format. I did see that sparse format takes significantly smaller >> >> > amount >> >> of >> >> > memory in caching RDD, but sparse is 40% slower than dense. I think >> >> sparse >> >> > should be fast since when we compute x wT, since x is sparse, we can >> >> > do >> >> it >> >> > faster. I wonder if there is anything I'm doing wrong. >> >> > > >> >> > > The attachment is the benchmark result. >> >> > > >> >> > > Thanks. >> >> > > >> >> > > Sincerely, >> >> > > >> >> > > DB Tsai >> >> > > ------------------------------------------------------- >> >> > > My Blog: https://www.dbtsai.com >> >> > > LinkedIn: https://www.linkedin.com/in/dbtsai >> >> > >> >> >> > >> > > >