Hi DB,

I saw you are using yarn-cluster mode for the benchmark. I tested the
yarn-cluster mode and found that YARN does not always give you the
exact number of executors requested. Just want to confirm that you've
checked the number of executors.

The second thing to check is that in the benchmark code, after you
call cache, you should also call count() to materialize the RDD. I saw
in the result, the real difference is actually at the first step.
Adding intercept is not a cheap operation for sparse vectors.

Best,
Xiangrui

On Thu, Apr 24, 2014 at 12:53 AM, Xiangrui Meng <men...@gmail.com> wrote:
> I don't think it is easy to make sparse faster than dense with this
> sparsity and feature dimension. You can try rcv1.binary, which should
> show the difference easily.
>
> David, the breeze operators used here are
>
> 1. DenseVector dot SparseVector
> 2. axpy DenseVector SparseVector
>
> However, the SparseVector is passed in as Vector[Double] instead of
> SparseVector[Double]. It might use the axpy impl of [DenseVector,
> Vector] and call activeIterator. I didn't check whether you used
> multimethods on axpy.
>
> Best,
> Xiangrui
>
> On Wed, Apr 23, 2014 at 10:35 PM, DB Tsai <dbt...@stanford.edu> wrote:
>> The figure showing the Log-Likelihood vs Time can be found here.
>>
>> https://github.com/dbtsai/spark-lbfgs-benchmark/raw/fd703303fb1c16ef5714901739154728550becf4/result/a9a11M.pdf
>>
>> Let me know if you can not open it. Thanks.
>>
>> Sincerely,
>>
>> DB Tsai
>> -------------------------------------------------------
>> My Blog: https://www.dbtsai.com
>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>
>>
>> On Wed, Apr 23, 2014 at 9:34 PM, Shivaram Venkataraman
>> <shiva...@eecs.berkeley.edu> wrote:
>>> I don't think the attachment came through in the list. Could you upload the
>>> results somewhere and link to them ?
>>>
>>>
>>> On Wed, Apr 23, 2014 at 9:32 PM, DB Tsai <dbt...@dbtsai.com> wrote:
>>>>
>>>> 123 features per rows, and in average, 89% are zeros.
>>>> On Apr 23, 2014 9:31 PM, "Evan Sparks" <evan.spa...@gmail.com> wrote:
>>>>
>>>> > What is the number of non zeroes per row (and number of features) in the
>>>> > sparse case? We've hit some issues with breeze sparse support in the
>>>> > past
>>>> > but for sufficiently sparse data it's still pretty good.
>>>> >
>>>> > > On Apr 23, 2014, at 9:21 PM, DB Tsai <dbt...@stanford.edu> wrote:
>>>> > >
>>>> > > Hi all,
>>>> > >
>>>> > > I'm benchmarking Logistic Regression in MLlib using the newly added
>>>> > optimizer LBFGS and GD. I'm using the same dataset and the same
>>>> > methodology
>>>> > in this paper, http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf
>>>> > >
>>>> > > I want to know how Spark scale while adding workers, and how
>>>> > > optimizers
>>>> > and input format (sparse or dense) impact performance.
>>>> > >
>>>> > > The benchmark code can be found here,
>>>> > https://github.com/dbtsai/spark-lbfgs-benchmark
>>>> > >
>>>> > > The first dataset I benchmarked is a9a which only has 2.2MB. I
>>>> > duplicated the dataset, and made it 762MB to have 11M rows. This dataset
>>>> > has 123 features and 11% of the data are non-zero elements.
>>>> > >
>>>> > > In this benchmark, all the dataset is cached in memory.
>>>> > >
>>>> > > As we expect, LBFGS converges faster than GD, and at some point, no
>>>> > matter how we push GD, it will converge slower and slower.
>>>> > >
>>>> > > However, it's surprising that sparse format runs slower than dense
>>>> > format. I did see that sparse format takes significantly smaller amount
>>>> > of
>>>> > memory in caching RDD, but sparse is 40% slower than dense. I think
>>>> > sparse
>>>> > should be fast since when we compute x wT, since x is sparse, we can do
>>>> > it
>>>> > faster. I wonder if there is anything I'm doing wrong.
>>>> > >
>>>> > > The attachment is the benchmark result.
>>>> > >
>>>> > > Thanks.
>>>> > >
>>>> > > Sincerely,
>>>> > >
>>>> > > DB Tsai
>>>> > > -------------------------------------------------------
>>>> > > My Blog: https://www.dbtsai.com
>>>> > > LinkedIn: https://www.linkedin.com/in/dbtsai
>>>> >
>>>
>>>

Reply via email to