Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

David Hall Tue, 29 Apr 2014 21:52:19 -0700

Yeah, that's probably the easiest though obviously pretty hacky.

I'm surprised that the hessian approximation isn't worse than it is. (As
in, I'd expect error messages.) It's obviously line searching much more, so
the approximation must be worse. You might be interested in this online
bfgs:
http://jmlr.org/proceedings/papers/v2/schraudolph07a/schraudolph07a.pdf


-- David


On Tue, Apr 29, 2014 at 3:30 PM, DB Tsai <[email protected]> wrote:

> Have a quick hack to understand the behavior of SLBFGS
> (Stochastic-LBFGS) by overwriting the breeze iterations method to get the
> current LBFGS step to ensure that the objective function is the same during
> the line search step. David, the following is my code, have a better way to
> inject into it?
>
> https://github.com/dbtsai/spark/tree/dbtsai-lbfgshack
>
> Couple findings,
>
> 1) miniBatch (using rdd sample api) for each iteration is slower than full
> data training when the full data is cached. Probably because sample is not
> efficiency in Spark.
>
> 2) Since in the line search steps, we use the same sample of data (the
> same objective function), the SLBFGS actually converges well.
>
> 3) For news20 dataset, with 0.05 miniBatch size, it takes 14 SLBFGS steps
> (29 data iterations, 74.5seconds) to converge to loss < 0.10. For LBFGS
> with full data training, it takes 9 LBFGS steps (12 data iterations, 37.6
> seconds) to converge to loss < 0.10.
>
> It seems that as long as the noisy gradient happens in different SLBFGS
> steps, it still works.
>
> (ps, I also tried in line search step, I use different sample of data, and
> it just doesn't work as we expect.)
>
>
>
> Sincerely,
>
> DB Tsai
> -------------------------------------------------------
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai
>
>
> On Mon, Apr 28, 2014 at 8:55 AM, David Hall <[email protected]> wrote:
>
>> That's right.
>>
>> FWIW, caching should be automatic now, but it might be the version of
>> Breeze you're using doesn't do that yet.
>>
>> Also, In breeze.util._ there's an implicit that adds a tee method to
>> iterator, and also a last method. Both are useful for things like this.
>>
>> -- David
>>
>>
>> On Sun, Apr 27, 2014 at 11:53 PM, DB Tsai <[email protected]> wrote:
>>
>>> I think I figure it out. Instead of calling minimize, and record the
>>> loss in the DiffFunction, I should do the following.
>>>
>>> val states = lbfgs.iterations(new CachedDiffFunction(costFun),
>>> initialWeights.toBreeze.toDenseVector)
>>> states.foreach(state => lossHistory.append(state.value))
>>>
>>> All the losses in states should be decreasing now. Am I right?
>>>
>>>
>>>
>>> Sincerely,
>>>
>>> DB Tsai
>>> -------------------------------------------------------
>>> My Blog: https://www.dbtsai.com
>>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>>
>>>
>>> On Sun, Apr 27, 2014 at 11:31 PM, DB Tsai <[email protected]> wrote:
>>>
>>>> Also, how many failure of rejection will terminate the optimization
>>>> process? How is it related to "numberOfImprovementFailures"?
>>>>
>>>> Thanks.
>>>>
>>>>
>>>> Sincerely,
>>>>
>>>> DB Tsai
>>>> -------------------------------------------------------
>>>> My Blog: https://www.dbtsai.com
>>>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>>>
>>>>
>>>> On Sun, Apr 27, 2014 at 11:28 PM, DB Tsai <[email protected]> wrote:
>>>>
>>>>> Hi David,
>>>>>
>>>>> I'm recording the loss history in the DiffFunction implementation, and
>>>>> that's why the rejected step is also recorded in my loss history.
>>>>>
>>>>> Is there any api in Breeze LBFGS to get the history which already
>>>>> excludes the reject step? Or should I just call "iterations" method and
>>>>> check "iteratingShouldStop" instead?
>>>>>
>>>>> Thanks.
>>>>>
>>>>>
>>>>> Sincerely,
>>>>>
>>>>> DB Tsai
>>>>> -------------------------------------------------------
>>>>> My Blog: https://www.dbtsai.com
>>>>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>>>>
>>>>>
>>>>> On Fri, Apr 25, 2014 at 3:10 PM, David Hall <[email protected]>wrote:
>>>>>
>>>>>> LBFGS will not take a step that sends the objective value up. It
>>>>>> might try a step that is "too big" and reject it, so if you're just 
>>>>>> logging
>>>>>> everything that gets tried by LBFGS, you could see that. The "iterations"
>>>>>> method of the minimizer should never return an increasing objective 
>>>>>> value.
>>>>>> If you're regularizing, are you including the regularizer in the 
>>>>>> objective
>>>>>> value computation?
>>>>>>
>>>>>> GD is almost never worth your time.
>>>>>>
>>>>>> -- David
>>>>>>
>>>>>> On Fri, Apr 25, 2014 at 2:57 PM, DB Tsai <[email protected]> wrote:
>>>>>>
>>>>>>> Another interesting benchmark.
>>>>>>>
>>>>>>> *News20 dataset - 0.14M row, 1,355,191 features, 0.034% non-zero
>>>>>>> elements.*
>>>>>>>
>>>>>>> LBFGS converges in 70 seconds, while GD seems to be not progressing.
>>>>>>>
>>>>>>> Dense feature vector will be too big to fit in the memory, so only
>>>>>>> conduct the sparse benchmark.
>>>>>>>
>>>>>>> I saw the sometimes the loss bumps up, and it's weird for me. Since
>>>>>>> the cost function of logistic regression is convex, it should be
>>>>>>> monotonically decreasing.  David, any suggestion?
>>>>>>>
>>>>>>> The detail figure:
>>>>>>>
>>>>>>> https://github.com/dbtsai/spark-lbfgs-benchmark/raw/0b774682e398b4f7e0ce01a69c44000eb0e73454/result/news20.pdf
>>>>>>>
>>>>>>>
>>>>>>> *Rcv1 dataset - 6.8M row, 677,399 features, 0.15% non-zero elements.*
>>>>>>>
>>>>>>> LBFGS converges in 25 seconds, while GD also seems to be not
>>>>>>> progressing.
>>>>>>>
>>>>>>> Only conduct sparse benchmark for the same reason. I also saw the
>>>>>>> loss bumps up for unknown reason.
>>>>>>>
>>>>>>> The detail figure:
>>>>>>>
>>>>>>> https://github.com/dbtsai/spark-lbfgs-benchmark/raw/0b774682e398b4f7e0ce01a69c44000eb0e73454/result/rcv1.pdf
>>>>>>>
>>>>>>>
>>>>>>> Sincerely,
>>>>>>>
>>>>>>> DB Tsai
>>>>>>> -------------------------------------------------------
>>>>>>> My Blog: https://www.dbtsai.com
>>>>>>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Apr 24, 2014 at 2:36 PM, DB Tsai <[email protected]>wrote:
>>>>>>>
>>>>>>>> rcv1.binary is too sparse (0.15% non-zero elements), so dense
>>>>>>>> format will not run due to out of memory. But sparse format runs really
>>>>>>>> well.
>>>>>>>>
>>>>>>>>
>>>>>>>> Sincerely,
>>>>>>>>
>>>>>>>> DB Tsai
>>>>>>>> -------------------------------------------------------
>>>>>>>> My Blog: https://www.dbtsai.com
>>>>>>>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Apr 24, 2014 at 1:54 PM, DB Tsai <[email protected]>wrote:
>>>>>>>>
>>>>>>>>> I'm doing the timer in runMiniBatchSGD after  val numExamples =
>>>>>>>>> data.count()
>>>>>>>>>
>>>>>>>>> See the following. Running rcv1 dataset now, and will update soon.
>>>>>>>>>
>>>>>>>>>     val startTime = System.nanoTime()
>>>>>>>>>     for (i <- 1 to numIterations) {
>>>>>>>>>       // Sample a subset (fraction miniBatchFraction) of the total
>>>>>>>>> data
>>>>>>>>>       // compute and sum up the subgradients on this subset (this
>>>>>>>>> is one map-reduce)
>>>>>>>>>       val (gradientSum, lossSum) = data.sample(false,
>>>>>>>>> miniBatchFraction, 42 + i)
>>>>>>>>>         .aggregate((BDV.zeros[Double](weights.size), 0.0))(
>>>>>>>>>           seqOp = (c, v) => (c, v) match { case ((grad, loss),
>>>>>>>>> (label, features)) =>
>>>>>>>>>             val l = gradient.compute(features, label, weights,
>>>>>>>>> Vectors.fromBreeze(grad))
>>>>>>>>>             (grad, loss + l)
>>>>>>>>>           },
>>>>>>>>>           combOp = (c1, c2) => (c1, c2) match { case ((grad1,
>>>>>>>>> loss1), (grad2, loss2)) =>
>>>>>>>>>             (grad1 += grad2, loss1 + loss2)
>>>>>>>>>           })
>>>>>>>>>
>>>>>>>>>       /**
>>>>>>>>>        * NOTE(Xinghao): lossSum is computed using the weights from
>>>>>>>>> the previous iteration
>>>>>>>>>        * and regVal is the regularization value computed in the
>>>>>>>>> previous iteration as well.
>>>>>>>>>        */
>>>>>>>>>       stochasticLossHistory.append(lossSum / miniBatchSize +
>>>>>>>>> regVal)
>>>>>>>>>       val update = updater.compute(
>>>>>>>>>         weights, Vectors.fromBreeze(gradientSum / miniBatchSize),
>>>>>>>>> stepSize, i, regParam)
>>>>>>>>>       weights = update._1
>>>>>>>>>       regVal = update._2
>>>>>>>>>       timeStamp.append(System.nanoTime() - startTime)
>>>>>>>>>     }
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Sincerely,
>>>>>>>>>
>>>>>>>>> DB Tsai
>>>>>>>>> -------------------------------------------------------
>>>>>>>>> My Blog: https://www.dbtsai.com
>>>>>>>>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Apr 24, 2014 at 1:44 PM, Xiangrui Meng 
>>>>>>>>> <[email protected]>wrote:
>>>>>>>>>
>>>>>>>>>> I don't understand why sparse falls behind dense so much at the
>>>>>>>>>> very
>>>>>>>>>> first iteration. I didn't see count() is called in
>>>>>>>>>>
>>>>>>>>>> https://github.com/dbtsai/spark-lbfgs-benchmark/blob/master/src/main/scala/org/apache/spark/mllib/benchmark/BinaryLogisticRegression.scala
>>>>>>>>>> . Maybe you have local uncommitted changes.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Xiangrui
>>>>>>>>>>
>>>>>>>>>> On Thu, Apr 24, 2014 at 11:26 AM, DB Tsai <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>> > Hi Xiangrui,
>>>>>>>>>> >
>>>>>>>>>> > Yes, I'm using yarn-cluster mode, and I did check # of
>>>>>>>>>> executors I specified
>>>>>>>>>> > are the same as the actual running executors.
>>>>>>>>>> >
>>>>>>>>>> > For caching and materialization, I've the timer in optimizer
>>>>>>>>>> after calling
>>>>>>>>>> > count(); as a result, the time for materialization in cache
>>>>>>>>>> isn't in the
>>>>>>>>>> > benchmark.
>>>>>>>>>> >
>>>>>>>>>> > The difference you saw is actually from dense feature or sparse
>>>>>>>>>> feature
>>>>>>>>>> > vector. For LBFGS and GD dense feature, you can see the first
>>>>>>>>>> iteration
>>>>>>>>>> > takes the same time. It's true for GD.
>>>>>>>>>> >
>>>>>>>>>> > I'm going to run rcv1.binary which only has 0.15% non-zero
>>>>>>>>>> elements to
>>>>>>>>>> > verify the hypothesis.
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>> > Sincerely,
>>>>>>>>>> >
>>>>>>>>>> > DB Tsai
>>>>>>>>>> > -------------------------------------------------------
>>>>>>>>>> > My Blog: https://www.dbtsai.com
>>>>>>>>>> > LinkedIn: https://www.linkedin.com/in/dbtsai
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>> > On Thu, Apr 24, 2014 at 1:09 AM, Xiangrui Meng <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>> >>
>>>>>>>>>> >> Hi DB,
>>>>>>>>>> >>
>>>>>>>>>> >> I saw you are using yarn-cluster mode for the benchmark. I
>>>>>>>>>> tested the
>>>>>>>>>> >> yarn-cluster mode and found that YARN does not always give you
>>>>>>>>>> the
>>>>>>>>>> >> exact number of executors requested. Just want to confirm that
>>>>>>>>>> you've
>>>>>>>>>> >> checked the number of executors.
>>>>>>>>>> >>
>>>>>>>>>> >> The second thing to check is that in the benchmark code, after
>>>>>>>>>> you
>>>>>>>>>> >> call cache, you should also call count() to materialize the
>>>>>>>>>> RDD. I saw
>>>>>>>>>> >> in the result, the real difference is actually at the first
>>>>>>>>>> step.
>>>>>>>>>> >> Adding intercept is not a cheap operation for sparse vectors.
>>>>>>>>>> >>
>>>>>>>>>> >> Best,
>>>>>>>>>> >> Xiangrui
>>>>>>>>>> >>
>>>>>>>>>> >> On Thu, Apr 24, 2014 at 12:53 AM, Xiangrui Meng <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>> >> > I don't think it is easy to make sparse faster than dense
>>>>>>>>>> with this
>>>>>>>>>> >> > sparsity and feature dimension. You can try rcv1.binary,
>>>>>>>>>> which should
>>>>>>>>>> >> > show the difference easily.
>>>>>>>>>> >> >
>>>>>>>>>> >> > David, the breeze operators used here are
>>>>>>>>>> >> >
>>>>>>>>>> >> > 1. DenseVector dot SparseVector
>>>>>>>>>> >> > 2. axpy DenseVector SparseVector
>>>>>>>>>> >> >
>>>>>>>>>> >> > However, the SparseVector is passed in as Vector[Double]
>>>>>>>>>> instead of
>>>>>>>>>> >> > SparseVector[Double]. It might use the axpy impl of
>>>>>>>>>> [DenseVector,
>>>>>>>>>> >> > Vector] and call activeIterator. I didn't check whether you
>>>>>>>>>> used
>>>>>>>>>> >> > multimethods on axpy.
>>>>>>>>>> >> >
>>>>>>>>>> >> > Best,
>>>>>>>>>> >> > Xiangrui
>>>>>>>>>> >> >
>>>>>>>>>> >> > On Wed, Apr 23, 2014 at 10:35 PM, DB Tsai <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>> >> >> The figure showing the Log-Likelihood vs Time can be found
>>>>>>>>>> here.
>>>>>>>>>> >> >>
>>>>>>>>>> >> >>
>>>>>>>>>> >> >>
>>>>>>>>>> https://github.com/dbtsai/spark-lbfgs-benchmark/raw/fd703303fb1c16ef5714901739154728550becf4/result/a9a11M.pdf
>>>>>>>>>> >> >>
>>>>>>>>>> >> >> Let me know if you can not open it. Thanks.
>>>>>>>>>> >> >>
>>>>>>>>>> >> >> Sincerely,
>>>>>>>>>> >> >>
>>>>>>>>>> >> >> DB Tsai
>>>>>>>>>> >> >> -------------------------------------------------------
>>>>>>>>>> >> >> My Blog: https://www.dbtsai.com
>>>>>>>>>> >> >> LinkedIn: https://www.linkedin.com/in/dbtsai
>>>>>>>>>> >> >>
>>>>>>>>>> >> >>
>>>>>>>>>> >> >> On Wed, Apr 23, 2014 at 9:34 PM, Shivaram Venkataraman
>>>>>>>>>> >> >> <[email protected]> wrote:
>>>>>>>>>> >> >>> I don't think the attachment came through in the list.
>>>>>>>>>> Could you
>>>>>>>>>> >> >>> upload the
>>>>>>>>>> >> >>> results somewhere and link to them ?
>>>>>>>>>> >> >>>
>>>>>>>>>> >> >>>
>>>>>>>>>> >> >>> On Wed, Apr 23, 2014 at 9:32 PM, DB Tsai <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>> >> >>>>
>>>>>>>>>> >> >>>> 123 features per rows, and in average, 89% are zeros.
>>>>>>>>>> >> >>>> On Apr 23, 2014 9:31 PM, "Evan Sparks" <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>> >> >>>>
>>>>>>>>>> >> >>>> > What is the number of non zeroes per row (and number of
>>>>>>>>>> features)
>>>>>>>>>> >> >>>> > in the
>>>>>>>>>> >> >>>> > sparse case? We've hit some issues with breeze sparse
>>>>>>>>>> support in
>>>>>>>>>> >> >>>> > the
>>>>>>>>>> >> >>>> > past
>>>>>>>>>> >> >>>> > but for sufficiently sparse data it's still pretty good.
>>>>>>>>>> >> >>>> >
>>>>>>>>>> >> >>>> > > On Apr 23, 2014, at 9:21 PM, DB Tsai <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>> >> >>>> > >
>>>>>>>>>> >> >>>> > > Hi all,
>>>>>>>>>> >> >>>> > >
>>>>>>>>>> >> >>>> > > I'm benchmarking Logistic Regression in MLlib using
>>>>>>>>>> the newly
>>>>>>>>>> >> >>>> > > added
>>>>>>>>>> >> >>>> > optimizer LBFGS and GD. I'm using the same dataset and
>>>>>>>>>> the same
>>>>>>>>>> >> >>>> > methodology
>>>>>>>>>> >> >>>> > in this paper,
>>>>>>>>>> http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf
>>>>>>>>>> >> >>>> > >
>>>>>>>>>> >> >>>> > > I want to know how Spark scale while adding workers,
>>>>>>>>>> and how
>>>>>>>>>> >> >>>> > > optimizers
>>>>>>>>>> >> >>>> > and input format (sparse or dense) impact performance.
>>>>>>>>>> >> >>>> > >
>>>>>>>>>> >> >>>> > > The benchmark code can be found here,
>>>>>>>>>> >> >>>> > https://github.com/dbtsai/spark-lbfgs-benchmark
>>>>>>>>>> >> >>>> > >
>>>>>>>>>> >> >>>> > > The first dataset I benchmarked is a9a which only has
>>>>>>>>>> 2.2MB. I
>>>>>>>>>> >> >>>> > duplicated the dataset, and made it 762MB to have 11M
>>>>>>>>>> rows. This
>>>>>>>>>> >> >>>> > dataset
>>>>>>>>>> >> >>>> > has 123 features and 11% of the data are non-zero
>>>>>>>>>> elements.
>>>>>>>>>> >> >>>> > >
>>>>>>>>>> >> >>>> > > In this benchmark, all the dataset is cached in
>>>>>>>>>> memory.
>>>>>>>>>> >> >>>> > >
>>>>>>>>>> >> >>>> > > As we expect, LBFGS converges faster than GD, and at
>>>>>>>>>> some point,
>>>>>>>>>> >> >>>> > > no
>>>>>>>>>> >> >>>> > matter how we push GD, it will converge slower and
>>>>>>>>>> slower.
>>>>>>>>>> >> >>>> > >
>>>>>>>>>> >> >>>> > > However, it's surprising that sparse format runs
>>>>>>>>>> slower than
>>>>>>>>>> >> >>>> > > dense
>>>>>>>>>> >> >>>> > format. I did see that sparse format takes
>>>>>>>>>> significantly smaller
>>>>>>>>>> >> >>>> > amount
>>>>>>>>>> >> >>>> > of
>>>>>>>>>> >> >>>> > memory in caching RDD, but sparse is 40% slower than
>>>>>>>>>> dense. I think
>>>>>>>>>> >> >>>> > sparse
>>>>>>>>>> >> >>>> > should be fast since when we compute x wT, since x is
>>>>>>>>>> sparse, we
>>>>>>>>>> >> >>>> > can do
>>>>>>>>>> >> >>>> > it
>>>>>>>>>> >> >>>> > faster. I wonder if there is anything I'm doing wrong.
>>>>>>>>>> >> >>>> > >
>>>>>>>>>> >> >>>> > > The attachment is the benchmark result.
>>>>>>>>>> >> >>>> > >
>>>>>>>>>> >> >>>> > > Thanks.
>>>>>>>>>> >> >>>> > >
>>>>>>>>>> >> >>>> > > Sincerely,
>>>>>>>>>> >> >>>> > >
>>>>>>>>>> >> >>>> > > DB Tsai
>>>>>>>>>> >> >>>> > >
>>>>>>>>>> -------------------------------------------------------
>>>>>>>>>> >> >>>> > > My Blog: https://www.dbtsai.com
>>>>>>>>>> >> >>>> > > LinkedIn: https://www.linkedin.com/in/dbtsai
>>>>>>>>>> >> >>>> >
>>>>>>>>>> >> >>>
>>>>>>>>>> >> >>>
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

Reply via email to