I don't think it is easy to make sparse faster than dense with this
sparsity and feature dimension. You can try rcv1.binary, which should
show the difference easily.

David, the breeze operators used here are

1. DenseVector dot SparseVector
2. axpy DenseVector SparseVector

However, the SparseVector is passed in as Vector[Double] instead of
SparseVector[Double]. It might use the axpy impl of [DenseVector,
Vector] and call activeIterator. I didn't check whether you used
multimethods on axpy.

Best,
Xiangrui

On Wed, Apr 23, 2014 at 10:35 PM, DB Tsai <dbt...@stanford.edu> wrote:
> The figure showing the Log-Likelihood vs Time can be found here.
>
> https://github.com/dbtsai/spark-lbfgs-benchmark/raw/fd703303fb1c16ef5714901739154728550becf4/result/a9a11M.pdf
>
> Let me know if you can not open it. Thanks.
>
> Sincerely,
>
> DB Tsai
> -------------------------------------------------------
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai
>
>
> On Wed, Apr 23, 2014 at 9:34 PM, Shivaram Venkataraman
> <shiva...@eecs.berkeley.edu> wrote:
>> I don't think the attachment came through in the list. Could you upload the
>> results somewhere and link to them ?
>>
>>
>> On Wed, Apr 23, 2014 at 9:32 PM, DB Tsai <dbt...@dbtsai.com> wrote:
>>>
>>> 123 features per rows, and in average, 89% are zeros.
>>> On Apr 23, 2014 9:31 PM, "Evan Sparks" <evan.spa...@gmail.com> wrote:
>>>
>>> > What is the number of non zeroes per row (and number of features) in the
>>> > sparse case? We've hit some issues with breeze sparse support in the
>>> > past
>>> > but for sufficiently sparse data it's still pretty good.
>>> >
>>> > > On Apr 23, 2014, at 9:21 PM, DB Tsai <dbt...@stanford.edu> wrote:
>>> > >
>>> > > Hi all,
>>> > >
>>> > > I'm benchmarking Logistic Regression in MLlib using the newly added
>>> > optimizer LBFGS and GD. I'm using the same dataset and the same
>>> > methodology
>>> > in this paper, http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf
>>> > >
>>> > > I want to know how Spark scale while adding workers, and how
>>> > > optimizers
>>> > and input format (sparse or dense) impact performance.
>>> > >
>>> > > The benchmark code can be found here,
>>> > https://github.com/dbtsai/spark-lbfgs-benchmark
>>> > >
>>> > > The first dataset I benchmarked is a9a which only has 2.2MB. I
>>> > duplicated the dataset, and made it 762MB to have 11M rows. This dataset
>>> > has 123 features and 11% of the data are non-zero elements.
>>> > >
>>> > > In this benchmark, all the dataset is cached in memory.
>>> > >
>>> > > As we expect, LBFGS converges faster than GD, and at some point, no
>>> > matter how we push GD, it will converge slower and slower.
>>> > >
>>> > > However, it's surprising that sparse format runs slower than dense
>>> > format. I did see that sparse format takes significantly smaller amount
>>> > of
>>> > memory in caching RDD, but sparse is 40% slower than dense. I think
>>> > sparse
>>> > should be fast since when we compute x wT, since x is sparse, we can do
>>> > it
>>> > faster. I wonder if there is anything I'm doing wrong.
>>> > >
>>> > > The attachment is the benchmark result.
>>> > >
>>> > > Thanks.
>>> > >
>>> > > Sincerely,
>>> > >
>>> > > DB Tsai
>>> > > -------------------------------------------------------
>>> > > My Blog: https://www.dbtsai.com
>>> > > LinkedIn: https://www.linkedin.com/in/dbtsai
>>> >
>>
>>

Reply via email to