I don't understand why sparse falls behind dense so much at the very
first iteration. I didn't see count() is called in
https://github.com/dbtsai/spark-lbfgs-benchmark/blob/master/src/main/scala/org/apache/spark/mllib/benchmark/BinaryLogisticRegression.scala
. Maybe you have local uncommitted changes.

Best,
Xiangrui

On Thu, Apr 24, 2014 at 11:26 AM, DB Tsai <dbt...@stanford.edu> wrote:
> Hi Xiangrui,
>
> Yes, I'm using yarn-cluster mode, and I did check # of executors I specified
> are the same as the actual running executors.
>
> For caching and materialization, I've the timer in optimizer after calling
> count(); as a result, the time for materialization in cache isn't in the
> benchmark.
>
> The difference you saw is actually from dense feature or sparse feature
> vector. For LBFGS and GD dense feature, you can see the first iteration
> takes the same time. It's true for GD.
>
> I'm going to run rcv1.binary which only has 0.15% non-zero elements to
> verify the hypothesis.
>
>
> Sincerely,
>
> DB Tsai
> -------------------------------------------------------
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai
>
>
> On Thu, Apr 24, 2014 at 1:09 AM, Xiangrui Meng <men...@gmail.com> wrote:
>>
>> Hi DB,
>>
>> I saw you are using yarn-cluster mode for the benchmark. I tested the
>> yarn-cluster mode and found that YARN does not always give you the
>> exact number of executors requested. Just want to confirm that you've
>> checked the number of executors.
>>
>> The second thing to check is that in the benchmark code, after you
>> call cache, you should also call count() to materialize the RDD. I saw
>> in the result, the real difference is actually at the first step.
>> Adding intercept is not a cheap operation for sparse vectors.
>>
>> Best,
>> Xiangrui
>>
>> On Thu, Apr 24, 2014 at 12:53 AM, Xiangrui Meng <men...@gmail.com> wrote:
>> > I don't think it is easy to make sparse faster than dense with this
>> > sparsity and feature dimension. You can try rcv1.binary, which should
>> > show the difference easily.
>> >
>> > David, the breeze operators used here are
>> >
>> > 1. DenseVector dot SparseVector
>> > 2. axpy DenseVector SparseVector
>> >
>> > However, the SparseVector is passed in as Vector[Double] instead of
>> > SparseVector[Double]. It might use the axpy impl of [DenseVector,
>> > Vector] and call activeIterator. I didn't check whether you used
>> > multimethods on axpy.
>> >
>> > Best,
>> > Xiangrui
>> >
>> > On Wed, Apr 23, 2014 at 10:35 PM, DB Tsai <dbt...@stanford.edu> wrote:
>> >> The figure showing the Log-Likelihood vs Time can be found here.
>> >>
>> >>
>> >> https://github.com/dbtsai/spark-lbfgs-benchmark/raw/fd703303fb1c16ef5714901739154728550becf4/result/a9a11M.pdf
>> >>
>> >> Let me know if you can not open it. Thanks.
>> >>
>> >> Sincerely,
>> >>
>> >> DB Tsai
>> >> -------------------------------------------------------
>> >> My Blog: https://www.dbtsai.com
>> >> LinkedIn: https://www.linkedin.com/in/dbtsai
>> >>
>> >>
>> >> On Wed, Apr 23, 2014 at 9:34 PM, Shivaram Venkataraman
>> >> <shiva...@eecs.berkeley.edu> wrote:
>> >>> I don't think the attachment came through in the list. Could you
>> >>> upload the
>> >>> results somewhere and link to them ?
>> >>>
>> >>>
>> >>> On Wed, Apr 23, 2014 at 9:32 PM, DB Tsai <dbt...@dbtsai.com> wrote:
>> >>>>
>> >>>> 123 features per rows, and in average, 89% are zeros.
>> >>>> On Apr 23, 2014 9:31 PM, "Evan Sparks" <evan.spa...@gmail.com> wrote:
>> >>>>
>> >>>> > What is the number of non zeroes per row (and number of features)
>> >>>> > in the
>> >>>> > sparse case? We've hit some issues with breeze sparse support in
>> >>>> > the
>> >>>> > past
>> >>>> > but for sufficiently sparse data it's still pretty good.
>> >>>> >
>> >>>> > > On Apr 23, 2014, at 9:21 PM, DB Tsai <dbt...@stanford.edu> wrote:
>> >>>> > >
>> >>>> > > Hi all,
>> >>>> > >
>> >>>> > > I'm benchmarking Logistic Regression in MLlib using the newly
>> >>>> > > added
>> >>>> > optimizer LBFGS and GD. I'm using the same dataset and the same
>> >>>> > methodology
>> >>>> > in this paper, http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf
>> >>>> > >
>> >>>> > > I want to know how Spark scale while adding workers, and how
>> >>>> > > optimizers
>> >>>> > and input format (sparse or dense) impact performance.
>> >>>> > >
>> >>>> > > The benchmark code can be found here,
>> >>>> > https://github.com/dbtsai/spark-lbfgs-benchmark
>> >>>> > >
>> >>>> > > The first dataset I benchmarked is a9a which only has 2.2MB. I
>> >>>> > duplicated the dataset, and made it 762MB to have 11M rows. This
>> >>>> > dataset
>> >>>> > has 123 features and 11% of the data are non-zero elements.
>> >>>> > >
>> >>>> > > In this benchmark, all the dataset is cached in memory.
>> >>>> > >
>> >>>> > > As we expect, LBFGS converges faster than GD, and at some point,
>> >>>> > > no
>> >>>> > matter how we push GD, it will converge slower and slower.
>> >>>> > >
>> >>>> > > However, it's surprising that sparse format runs slower than
>> >>>> > > dense
>> >>>> > format. I did see that sparse format takes significantly smaller
>> >>>> > amount
>> >>>> > of
>> >>>> > memory in caching RDD, but sparse is 40% slower than dense. I think
>> >>>> > sparse
>> >>>> > should be fast since when we compute x wT, since x is sparse, we
>> >>>> > can do
>> >>>> > it
>> >>>> > faster. I wonder if there is anything I'm doing wrong.
>> >>>> > >
>> >>>> > > The attachment is the benchmark result.
>> >>>> > >
>> >>>> > > Thanks.
>> >>>> > >
>> >>>> > > Sincerely,
>> >>>> > >
>> >>>> > > DB Tsai
>> >>>> > > -------------------------------------------------------
>> >>>> > > My Blog: https://www.dbtsai.com
>> >>>> > > LinkedIn: https://www.linkedin.com/in/dbtsai
>> >>>> >
>> >>>
>> >>>
>
>

Reply via email to