Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

David Hall Mon, 28 Apr 2014 08:56:25 -0700

That's right.

FWIW, caching should be automatic now, but it might be the version of
Breeze you're using doesn't do that yet.


Also, In breeze.util._ there's an implicit that adds a tee method to
iterator, and also a last method. Both are useful for things like this.

-- David

On Sun, Apr 27, 2014 at 11:53 PM, DB Tsai <dbt...@stanford.edu> wrote:

> I think I figure it out. Instead of calling minimize, and record the loss
> in the DiffFunction, I should do the following.
>
> val states = lbfgs.iterations(new CachedDiffFunction(costFun),
> initialWeights.toBreeze.toDenseVector)
> states.foreach(state => lossHistory.append(state.value))
>
> All the losses in states should be decreasing now. Am I right?
>
>
>
> Sincerely,
>
> DB Tsai
> -------------------------------------------------------
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai
>
>
> On Sun, Apr 27, 2014 at 11:31 PM, DB Tsai <dbt...@stanford.edu> wrote:
>
>> Also, how many failure of rejection will terminate the optimization
>> process? How is it related to "numberOfImprovementFailures"?
>>
>> Thanks.
>>
>>
>> Sincerely,
>>
>> DB Tsai
>> -------------------------------------------------------
>> My Blog: https://www.dbtsai.com
>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>
>>
>> On Sun, Apr 27, 2014 at 11:28 PM, DB Tsai <dbt...@stanford.edu> wrote:
>>
>>> Hi David,
>>>
>>> I'm recording the loss history in the DiffFunction implementation, and
>>> that's why the rejected step is also recorded in my loss history.
>>>
>>> Is there any api in Breeze LBFGS to get the history which already
>>> excludes the reject step? Or should I just call "iterations" method and
>>> check "iteratingShouldStop" instead?
>>>
>>> Thanks.
>>>
>>>
>>> Sincerely,
>>>
>>> DB Tsai
>>> -------------------------------------------------------
>>> My Blog: https://www.dbtsai.com
>>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>>
>>>
>>> On Fri, Apr 25, 2014 at 3:10 PM, David Hall <d...@cs.berkeley.edu>wrote:
>>>
>>>> LBFGS will not take a step that sends the objective value up. It might
>>>> try a step that is "too big" and reject it, so if you're just logging
>>>> everything that gets tried by LBFGS, you could see that. The "iterations"
>>>> method of the minimizer should never return an increasing objective value.
>>>> If you're regularizing, are you including the regularizer in the objective
>>>> value computation?
>>>>
>>>> GD is almost never worth your time.
>>>>
>>>> -- David
>>>>
>>>> On Fri, Apr 25, 2014 at 2:57 PM, DB Tsai <dbt...@stanford.edu> wrote:
>>>>
>>>>> Another interesting benchmark.
>>>>>
>>>>> *News20 dataset - 0.14M row, 1,355,191 features, 0.034% non-zero
>>>>> elements.*
>>>>>
>>>>> LBFGS converges in 70 seconds, while GD seems to be not progressing.
>>>>>
>>>>> Dense feature vector will be too big to fit in the memory, so only
>>>>> conduct the sparse benchmark.
>>>>>
>>>>> I saw the sometimes the loss bumps up, and it's weird for me. Since
>>>>> the cost function of logistic regression is convex, it should be
>>>>> monotonically decreasing.  David, any suggestion?
>>>>>
>>>>> The detail figure:
>>>>>
>>>>> https://github.com/dbtsai/spark-lbfgs-benchmark/raw/0b774682e398b4f7e0ce01a69c44000eb0e73454/result/news20.pdf
>>>>>
>>>>>
>>>>> *Rcv1 dataset - 6.8M row, 677,399 features, 0.15% non-zero elements.*
>>>>>
>>>>> LBFGS converges in 25 seconds, while GD also seems to be not
>>>>> progressing.
>>>>>
>>>>> Only conduct sparse benchmark for the same reason. I also saw the loss
>>>>> bumps up for unknown reason.
>>>>>
>>>>> The detail figure:
>>>>>
>>>>> https://github.com/dbtsai/spark-lbfgs-benchmark/raw/0b774682e398b4f7e0ce01a69c44000eb0e73454/result/rcv1.pdf
>>>>>
>>>>>
>>>>> Sincerely,
>>>>>
>>>>> DB Tsai
>>>>> -------------------------------------------------------
>>>>> My Blog: https://www.dbtsai.com
>>>>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>>>>
>>>>>
>>>>> On Thu, Apr 24, 2014 at 2:36 PM, DB Tsai <dbt...@stanford.edu> wrote:
>>>>>
>>>>>> rcv1.binary is too sparse (0.15% non-zero elements), so dense format
>>>>>> will not run due to out of memory. But sparse format runs really well.
>>>>>>
>>>>>>
>>>>>> Sincerely,
>>>>>>
>>>>>> DB Tsai
>>>>>> -------------------------------------------------------
>>>>>> My Blog: https://www.dbtsai.com
>>>>>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>>>>>
>>>>>>
>>>>>> On Thu, Apr 24, 2014 at 1:54 PM, DB Tsai <dbt...@stanford.edu> wrote:
>>>>>>
>>>>>>> I'm doing the timer in runMiniBatchSGD after  val numExamples =
>>>>>>> data.count()
>>>>>>>
>>>>>>> See the following. Running rcv1 dataset now, and will update soon.
>>>>>>>
>>>>>>>     val startTime = System.nanoTime()
>>>>>>>     for (i <- 1 to numIterations) {
>>>>>>>       // Sample a subset (fraction miniBatchFraction) of the total
>>>>>>> data
>>>>>>>       // compute and sum up the subgradients on this subset (this is
>>>>>>> one map-reduce)
>>>>>>>       val (gradientSum, lossSum) = data.sample(false,
>>>>>>> miniBatchFraction, 42 + i)
>>>>>>>         .aggregate((BDV.zeros[Double](weights.size), 0.0))(
>>>>>>>           seqOp = (c, v) => (c, v) match { case ((grad, loss),
>>>>>>> (label, features)) =>
>>>>>>>             val l = gradient.compute(features, label, weights,
>>>>>>> Vectors.fromBreeze(grad))
>>>>>>>             (grad, loss + l)
>>>>>>>           },
>>>>>>>           combOp = (c1, c2) => (c1, c2) match { case ((grad1,
>>>>>>> loss1), (grad2, loss2)) =>
>>>>>>>             (grad1 += grad2, loss1 + loss2)
>>>>>>>           })
>>>>>>>
>>>>>>>       /**
>>>>>>>        * NOTE(Xinghao): lossSum is computed using the weights from
>>>>>>> the previous iteration
>>>>>>>        * and regVal is the regularization value computed in the
>>>>>>> previous iteration as well.
>>>>>>>        */
>>>>>>>       stochasticLossHistory.append(lossSum / miniBatchSize + regVal)
>>>>>>>       val update = updater.compute(
>>>>>>>         weights, Vectors.fromBreeze(gradientSum / miniBatchSize),
>>>>>>> stepSize, i, regParam)
>>>>>>>       weights = update._1
>>>>>>>       regVal = update._2
>>>>>>>       timeStamp.append(System.nanoTime() - startTime)
>>>>>>>     }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Sincerely,
>>>>>>>
>>>>>>> DB Tsai
>>>>>>> -------------------------------------------------------
>>>>>>> My Blog: https://www.dbtsai.com
>>>>>>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Apr 24, 2014 at 1:44 PM, Xiangrui Meng <men...@gmail.com>wrote:
>>>>>>>
>>>>>>>> I don't understand why sparse falls behind dense so much at the very
>>>>>>>> first iteration. I didn't see count() is called in
>>>>>>>>
>>>>>>>> https://github.com/dbtsai/spark-lbfgs-benchmark/blob/master/src/main/scala/org/apache/spark/mllib/benchmark/BinaryLogisticRegression.scala
>>>>>>>> . Maybe you have local uncommitted changes.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Xiangrui
>>>>>>>>
>>>>>>>> On Thu, Apr 24, 2014 at 11:26 AM, DB Tsai <dbt...@stanford.edu>
>>>>>>>> wrote:
>>>>>>>> > Hi Xiangrui,
>>>>>>>> >
>>>>>>>> > Yes, I'm using yarn-cluster mode, and I did check # of executors
>>>>>>>> I specified
>>>>>>>> > are the same as the actual running executors.
>>>>>>>> >
>>>>>>>> > For caching and materialization, I've the timer in optimizer
>>>>>>>> after calling
>>>>>>>> > count(); as a result, the time for materialization in cache isn't
>>>>>>>> in the
>>>>>>>> > benchmark.
>>>>>>>> >
>>>>>>>> > The difference you saw is actually from dense feature or sparse
>>>>>>>> feature
>>>>>>>> > vector. For LBFGS and GD dense feature, you can see the first
>>>>>>>> iteration
>>>>>>>> > takes the same time. It's true for GD.
>>>>>>>> >
>>>>>>>> > I'm going to run rcv1.binary which only has 0.15% non-zero
>>>>>>>> elements to
>>>>>>>> > verify the hypothesis.
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > Sincerely,
>>>>>>>> >
>>>>>>>> > DB Tsai
>>>>>>>> > -------------------------------------------------------
>>>>>>>> > My Blog: https://www.dbtsai.com
>>>>>>>> > LinkedIn: https://www.linkedin.com/in/dbtsai
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > On Thu, Apr 24, 2014 at 1:09 AM, Xiangrui Meng <men...@gmail.com>
>>>>>>>> wrote:
>>>>>>>> >>
>>>>>>>> >> Hi DB,
>>>>>>>> >>
>>>>>>>> >> I saw you are using yarn-cluster mode for the benchmark. I
>>>>>>>> tested the
>>>>>>>> >> yarn-cluster mode and found that YARN does not always give you
>>>>>>>> the
>>>>>>>> >> exact number of executors requested. Just want to confirm that
>>>>>>>> you've
>>>>>>>> >> checked the number of executors.
>>>>>>>> >>
>>>>>>>> >> The second thing to check is that in the benchmark code, after
>>>>>>>> you
>>>>>>>> >> call cache, you should also call count() to materialize the RDD.
>>>>>>>> I saw
>>>>>>>> >> in the result, the real difference is actually at the first step.
>>>>>>>> >> Adding intercept is not a cheap operation for sparse vectors.
>>>>>>>> >>
>>>>>>>> >> Best,
>>>>>>>> >> Xiangrui
>>>>>>>> >>
>>>>>>>> >> On Thu, Apr 24, 2014 at 12:53 AM, Xiangrui Meng <
>>>>>>>> men...@gmail.com> wrote:
>>>>>>>> >> > I don't think it is easy to make sparse faster than dense with
>>>>>>>> this
>>>>>>>> >> > sparsity and feature dimension. You can try rcv1.binary, which
>>>>>>>> should
>>>>>>>> >> > show the difference easily.
>>>>>>>> >> >
>>>>>>>> >> > David, the breeze operators used here are
>>>>>>>> >> >
>>>>>>>> >> > 1. DenseVector dot SparseVector
>>>>>>>> >> > 2. axpy DenseVector SparseVector
>>>>>>>> >> >
>>>>>>>> >> > However, the SparseVector is passed in as Vector[Double]
>>>>>>>> instead of
>>>>>>>> >> > SparseVector[Double]. It might use the axpy impl of
>>>>>>>> [DenseVector,
>>>>>>>> >> > Vector] and call activeIterator. I didn't check whether you
>>>>>>>> used
>>>>>>>> >> > multimethods on axpy.
>>>>>>>> >> >
>>>>>>>> >> > Best,
>>>>>>>> >> > Xiangrui
>>>>>>>> >> >
>>>>>>>> >> > On Wed, Apr 23, 2014 at 10:35 PM, DB Tsai <dbt...@stanford.edu>
>>>>>>>> wrote:
>>>>>>>> >> >> The figure showing the Log-Likelihood vs Time can be found
>>>>>>>> here.
>>>>>>>> >> >>
>>>>>>>> >> >>
>>>>>>>> >> >>
>>>>>>>> https://github.com/dbtsai/spark-lbfgs-benchmark/raw/fd703303fb1c16ef5714901739154728550becf4/result/a9a11M.pdf
>>>>>>>> >> >>
>>>>>>>> >> >> Let me know if you can not open it. Thanks.
>>>>>>>> >> >>
>>>>>>>> >> >> Sincerely,
>>>>>>>> >> >>
>>>>>>>> >> >> DB Tsai
>>>>>>>> >> >> -------------------------------------------------------
>>>>>>>> >> >> My Blog: https://www.dbtsai.com
>>>>>>>> >> >> LinkedIn: https://www.linkedin.com/in/dbtsai
>>>>>>>> >> >>
>>>>>>>> >> >>
>>>>>>>> >> >> On Wed, Apr 23, 2014 at 9:34 PM, Shivaram Venkataraman
>>>>>>>> >> >> <shiva...@eecs.berkeley.edu> wrote:
>>>>>>>> >> >>> I don't think the attachment came through in the list. Could
>>>>>>>> you
>>>>>>>> >> >>> upload the
>>>>>>>> >> >>> results somewhere and link to them ?
>>>>>>>> >> >>>
>>>>>>>> >> >>>
>>>>>>>> >> >>> On Wed, Apr 23, 2014 at 9:32 PM, DB Tsai <dbt...@dbtsai.com>
>>>>>>>> wrote:
>>>>>>>> >> >>>>
>>>>>>>> >> >>>> 123 features per rows, and in average, 89% are zeros.
>>>>>>>> >> >>>> On Apr 23, 2014 9:31 PM, "Evan Sparks" <
>>>>>>>> evan.spa...@gmail.com> wrote:
>>>>>>>> >> >>>>
>>>>>>>> >> >>>> > What is the number of non zeroes per row (and number of
>>>>>>>> features)
>>>>>>>> >> >>>> > in the
>>>>>>>> >> >>>> > sparse case? We've hit some issues with breeze sparse
>>>>>>>> support in
>>>>>>>> >> >>>> > the
>>>>>>>> >> >>>> > past
>>>>>>>> >> >>>> > but for sufficiently sparse data it's still pretty good.
>>>>>>>> >> >>>> >
>>>>>>>> >> >>>> > > On Apr 23, 2014, at 9:21 PM, DB Tsai <
>>>>>>>> dbt...@stanford.edu> wrote:
>>>>>>>> >> >>>> > >
>>>>>>>> >> >>>> > > Hi all,
>>>>>>>> >> >>>> > >
>>>>>>>> >> >>>> > > I'm benchmarking Logistic Regression in MLlib using the
>>>>>>>> newly
>>>>>>>> >> >>>> > > added
>>>>>>>> >> >>>> > optimizer LBFGS and GD. I'm using the same dataset and
>>>>>>>> the same
>>>>>>>> >> >>>> > methodology
>>>>>>>> >> >>>> > in this paper,
>>>>>>>> http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf
>>>>>>>> >> >>>> > >
>>>>>>>> >> >>>> > > I want to know how Spark scale while adding workers,
>>>>>>>> and how
>>>>>>>> >> >>>> > > optimizers
>>>>>>>> >> >>>> > and input format (sparse or dense) impact performance.
>>>>>>>> >> >>>> > >
>>>>>>>> >> >>>> > > The benchmark code can be found here,
>>>>>>>> >> >>>> > https://github.com/dbtsai/spark-lbfgs-benchmark
>>>>>>>> >> >>>> > >
>>>>>>>> >> >>>> > > The first dataset I benchmarked is a9a which only has
>>>>>>>> 2.2MB. I
>>>>>>>> >> >>>> > duplicated the dataset, and made it 762MB to have 11M
>>>>>>>> rows. This
>>>>>>>> >> >>>> > dataset
>>>>>>>> >> >>>> > has 123 features and 11% of the data are non-zero
>>>>>>>> elements.
>>>>>>>> >> >>>> > >
>>>>>>>> >> >>>> > > In this benchmark, all the dataset is cached in memory.
>>>>>>>> >> >>>> > >
>>>>>>>> >> >>>> > > As we expect, LBFGS converges faster than GD, and at
>>>>>>>> some point,
>>>>>>>> >> >>>> > > no
>>>>>>>> >> >>>> > matter how we push GD, it will converge slower and slower.
>>>>>>>> >> >>>> > >
>>>>>>>> >> >>>> > > However, it's surprising that sparse format runs slower
>>>>>>>> than
>>>>>>>> >> >>>> > > dense
>>>>>>>> >> >>>> > format. I did see that sparse format takes significantly
>>>>>>>> smaller
>>>>>>>> >> >>>> > amount
>>>>>>>> >> >>>> > of
>>>>>>>> >> >>>> > memory in caching RDD, but sparse is 40% slower than
>>>>>>>> dense. I think
>>>>>>>> >> >>>> > sparse
>>>>>>>> >> >>>> > should be fast since when we compute x wT, since x is
>>>>>>>> sparse, we
>>>>>>>> >> >>>> > can do
>>>>>>>> >> >>>> > it
>>>>>>>> >> >>>> > faster. I wonder if there is anything I'm doing wrong.
>>>>>>>> >> >>>> > >
>>>>>>>> >> >>>> > > The attachment is the benchmark result.
>>>>>>>> >> >>>> > >
>>>>>>>> >> >>>> > > Thanks.
>>>>>>>> >> >>>> > >
>>>>>>>> >> >>>> > > Sincerely,
>>>>>>>> >> >>>> > >
>>>>>>>> >> >>>> > > DB Tsai
>>>>>>>> >> >>>> > > -------------------------------------------------------
>>>>>>>> >> >>>> > > My Blog: https://www.dbtsai.com
>>>>>>>> >> >>>> > > LinkedIn: https://www.linkedin.com/in/dbtsai
>>>>>>>> >> >>>> >
>>>>>>>> >> >>>
>>>>>>>> >> >>>
>>>>>>>> >
>>>>>>>> >
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

Reply via email to