That's right. FWIW, caching should be automatic now, but it might be the version of Breeze you're using doesn't do that yet.
Also, In breeze.util._ there's an implicit that adds a tee method to iterator, and also a last method. Both are useful for things like this. -- David On Sun, Apr 27, 2014 at 11:53 PM, DB Tsai <dbt...@stanford.edu> wrote: > I think I figure it out. Instead of calling minimize, and record the loss > in the DiffFunction, I should do the following. > > val states = lbfgs.iterations(new CachedDiffFunction(costFun), > initialWeights.toBreeze.toDenseVector) > states.foreach(state => lossHistory.append(state.value)) > > All the losses in states should be decreasing now. Am I right? > > > > Sincerely, > > DB Tsai > ------------------------------------------------------- > My Blog: https://www.dbtsai.com > LinkedIn: https://www.linkedin.com/in/dbtsai > > > On Sun, Apr 27, 2014 at 11:31 PM, DB Tsai <dbt...@stanford.edu> wrote: > >> Also, how many failure of rejection will terminate the optimization >> process? How is it related to "numberOfImprovementFailures"? >> >> Thanks. >> >> >> Sincerely, >> >> DB Tsai >> ------------------------------------------------------- >> My Blog: https://www.dbtsai.com >> LinkedIn: https://www.linkedin.com/in/dbtsai >> >> >> On Sun, Apr 27, 2014 at 11:28 PM, DB Tsai <dbt...@stanford.edu> wrote: >> >>> Hi David, >>> >>> I'm recording the loss history in the DiffFunction implementation, and >>> that's why the rejected step is also recorded in my loss history. >>> >>> Is there any api in Breeze LBFGS to get the history which already >>> excludes the reject step? Or should I just call "iterations" method and >>> check "iteratingShouldStop" instead? >>> >>> Thanks. >>> >>> >>> Sincerely, >>> >>> DB Tsai >>> ------------------------------------------------------- >>> My Blog: https://www.dbtsai.com >>> LinkedIn: https://www.linkedin.com/in/dbtsai >>> >>> >>> On Fri, Apr 25, 2014 at 3:10 PM, David Hall <d...@cs.berkeley.edu>wrote: >>> >>>> LBFGS will not take a step that sends the objective value up. It might >>>> try a step that is "too big" and reject it, so if you're just logging >>>> everything that gets tried by LBFGS, you could see that. The "iterations" >>>> method of the minimizer should never return an increasing objective value. >>>> If you're regularizing, are you including the regularizer in the objective >>>> value computation? >>>> >>>> GD is almost never worth your time. >>>> >>>> -- David >>>> >>>> On Fri, Apr 25, 2014 at 2:57 PM, DB Tsai <dbt...@stanford.edu> wrote: >>>> >>>>> Another interesting benchmark. >>>>> >>>>> *News20 dataset - 0.14M row, 1,355,191 features, 0.034% non-zero >>>>> elements.* >>>>> >>>>> LBFGS converges in 70 seconds, while GD seems to be not progressing. >>>>> >>>>> Dense feature vector will be too big to fit in the memory, so only >>>>> conduct the sparse benchmark. >>>>> >>>>> I saw the sometimes the loss bumps up, and it's weird for me. Since >>>>> the cost function of logistic regression is convex, it should be >>>>> monotonically decreasing. David, any suggestion? >>>>> >>>>> The detail figure: >>>>> >>>>> https://github.com/dbtsai/spark-lbfgs-benchmark/raw/0b774682e398b4f7e0ce01a69c44000eb0e73454/result/news20.pdf >>>>> >>>>> >>>>> *Rcv1 dataset - 6.8M row, 677,399 features, 0.15% non-zero elements.* >>>>> >>>>> LBFGS converges in 25 seconds, while GD also seems to be not >>>>> progressing. >>>>> >>>>> Only conduct sparse benchmark for the same reason. I also saw the loss >>>>> bumps up for unknown reason. >>>>> >>>>> The detail figure: >>>>> >>>>> https://github.com/dbtsai/spark-lbfgs-benchmark/raw/0b774682e398b4f7e0ce01a69c44000eb0e73454/result/rcv1.pdf >>>>> >>>>> >>>>> Sincerely, >>>>> >>>>> DB Tsai >>>>> ------------------------------------------------------- >>>>> My Blog: https://www.dbtsai.com >>>>> LinkedIn: https://www.linkedin.com/in/dbtsai >>>>> >>>>> >>>>> On Thu, Apr 24, 2014 at 2:36 PM, DB Tsai <dbt...@stanford.edu> wrote: >>>>> >>>>>> rcv1.binary is too sparse (0.15% non-zero elements), so dense format >>>>>> will not run due to out of memory. But sparse format runs really well. >>>>>> >>>>>> >>>>>> Sincerely, >>>>>> >>>>>> DB Tsai >>>>>> ------------------------------------------------------- >>>>>> My Blog: https://www.dbtsai.com >>>>>> LinkedIn: https://www.linkedin.com/in/dbtsai >>>>>> >>>>>> >>>>>> On Thu, Apr 24, 2014 at 1:54 PM, DB Tsai <dbt...@stanford.edu> wrote: >>>>>> >>>>>>> I'm doing the timer in runMiniBatchSGD after val numExamples = >>>>>>> data.count() >>>>>>> >>>>>>> See the following. Running rcv1 dataset now, and will update soon. >>>>>>> >>>>>>> val startTime = System.nanoTime() >>>>>>> for (i <- 1 to numIterations) { >>>>>>> // Sample a subset (fraction miniBatchFraction) of the total >>>>>>> data >>>>>>> // compute and sum up the subgradients on this subset (this is >>>>>>> one map-reduce) >>>>>>> val (gradientSum, lossSum) = data.sample(false, >>>>>>> miniBatchFraction, 42 + i) >>>>>>> .aggregate((BDV.zeros[Double](weights.size), 0.0))( >>>>>>> seqOp = (c, v) => (c, v) match { case ((grad, loss), >>>>>>> (label, features)) => >>>>>>> val l = gradient.compute(features, label, weights, >>>>>>> Vectors.fromBreeze(grad)) >>>>>>> (grad, loss + l) >>>>>>> }, >>>>>>> combOp = (c1, c2) => (c1, c2) match { case ((grad1, >>>>>>> loss1), (grad2, loss2)) => >>>>>>> (grad1 += grad2, loss1 + loss2) >>>>>>> }) >>>>>>> >>>>>>> /** >>>>>>> * NOTE(Xinghao): lossSum is computed using the weights from >>>>>>> the previous iteration >>>>>>> * and regVal is the regularization value computed in the >>>>>>> previous iteration as well. >>>>>>> */ >>>>>>> stochasticLossHistory.append(lossSum / miniBatchSize + regVal) >>>>>>> val update = updater.compute( >>>>>>> weights, Vectors.fromBreeze(gradientSum / miniBatchSize), >>>>>>> stepSize, i, regParam) >>>>>>> weights = update._1 >>>>>>> regVal = update._2 >>>>>>> timeStamp.append(System.nanoTime() - startTime) >>>>>>> } >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> Sincerely, >>>>>>> >>>>>>> DB Tsai >>>>>>> ------------------------------------------------------- >>>>>>> My Blog: https://www.dbtsai.com >>>>>>> LinkedIn: https://www.linkedin.com/in/dbtsai >>>>>>> >>>>>>> >>>>>>> On Thu, Apr 24, 2014 at 1:44 PM, Xiangrui Meng <men...@gmail.com>wrote: >>>>>>> >>>>>>>> I don't understand why sparse falls behind dense so much at the very >>>>>>>> first iteration. I didn't see count() is called in >>>>>>>> >>>>>>>> https://github.com/dbtsai/spark-lbfgs-benchmark/blob/master/src/main/scala/org/apache/spark/mllib/benchmark/BinaryLogisticRegression.scala >>>>>>>> . Maybe you have local uncommitted changes. >>>>>>>> >>>>>>>> Best, >>>>>>>> Xiangrui >>>>>>>> >>>>>>>> On Thu, Apr 24, 2014 at 11:26 AM, DB Tsai <dbt...@stanford.edu> >>>>>>>> wrote: >>>>>>>> > Hi Xiangrui, >>>>>>>> > >>>>>>>> > Yes, I'm using yarn-cluster mode, and I did check # of executors >>>>>>>> I specified >>>>>>>> > are the same as the actual running executors. >>>>>>>> > >>>>>>>> > For caching and materialization, I've the timer in optimizer >>>>>>>> after calling >>>>>>>> > count(); as a result, the time for materialization in cache isn't >>>>>>>> in the >>>>>>>> > benchmark. >>>>>>>> > >>>>>>>> > The difference you saw is actually from dense feature or sparse >>>>>>>> feature >>>>>>>> > vector. For LBFGS and GD dense feature, you can see the first >>>>>>>> iteration >>>>>>>> > takes the same time. It's true for GD. >>>>>>>> > >>>>>>>> > I'm going to run rcv1.binary which only has 0.15% non-zero >>>>>>>> elements to >>>>>>>> > verify the hypothesis. >>>>>>>> > >>>>>>>> > >>>>>>>> > Sincerely, >>>>>>>> > >>>>>>>> > DB Tsai >>>>>>>> > ------------------------------------------------------- >>>>>>>> > My Blog: https://www.dbtsai.com >>>>>>>> > LinkedIn: https://www.linkedin.com/in/dbtsai >>>>>>>> > >>>>>>>> > >>>>>>>> > On Thu, Apr 24, 2014 at 1:09 AM, Xiangrui Meng <men...@gmail.com> >>>>>>>> wrote: >>>>>>>> >> >>>>>>>> >> Hi DB, >>>>>>>> >> >>>>>>>> >> I saw you are using yarn-cluster mode for the benchmark. I >>>>>>>> tested the >>>>>>>> >> yarn-cluster mode and found that YARN does not always give you >>>>>>>> the >>>>>>>> >> exact number of executors requested. Just want to confirm that >>>>>>>> you've >>>>>>>> >> checked the number of executors. >>>>>>>> >> >>>>>>>> >> The second thing to check is that in the benchmark code, after >>>>>>>> you >>>>>>>> >> call cache, you should also call count() to materialize the RDD. >>>>>>>> I saw >>>>>>>> >> in the result, the real difference is actually at the first step. >>>>>>>> >> Adding intercept is not a cheap operation for sparse vectors. >>>>>>>> >> >>>>>>>> >> Best, >>>>>>>> >> Xiangrui >>>>>>>> >> >>>>>>>> >> On Thu, Apr 24, 2014 at 12:53 AM, Xiangrui Meng < >>>>>>>> men...@gmail.com> wrote: >>>>>>>> >> > I don't think it is easy to make sparse faster than dense with >>>>>>>> this >>>>>>>> >> > sparsity and feature dimension. You can try rcv1.binary, which >>>>>>>> should >>>>>>>> >> > show the difference easily. >>>>>>>> >> > >>>>>>>> >> > David, the breeze operators used here are >>>>>>>> >> > >>>>>>>> >> > 1. DenseVector dot SparseVector >>>>>>>> >> > 2. axpy DenseVector SparseVector >>>>>>>> >> > >>>>>>>> >> > However, the SparseVector is passed in as Vector[Double] >>>>>>>> instead of >>>>>>>> >> > SparseVector[Double]. It might use the axpy impl of >>>>>>>> [DenseVector, >>>>>>>> >> > Vector] and call activeIterator. I didn't check whether you >>>>>>>> used >>>>>>>> >> > multimethods on axpy. >>>>>>>> >> > >>>>>>>> >> > Best, >>>>>>>> >> > Xiangrui >>>>>>>> >> > >>>>>>>> >> > On Wed, Apr 23, 2014 at 10:35 PM, DB Tsai <dbt...@stanford.edu> >>>>>>>> wrote: >>>>>>>> >> >> The figure showing the Log-Likelihood vs Time can be found >>>>>>>> here. >>>>>>>> >> >> >>>>>>>> >> >> >>>>>>>> >> >> >>>>>>>> https://github.com/dbtsai/spark-lbfgs-benchmark/raw/fd703303fb1c16ef5714901739154728550becf4/result/a9a11M.pdf >>>>>>>> >> >> >>>>>>>> >> >> Let me know if you can not open it. Thanks. >>>>>>>> >> >> >>>>>>>> >> >> Sincerely, >>>>>>>> >> >> >>>>>>>> >> >> DB Tsai >>>>>>>> >> >> ------------------------------------------------------- >>>>>>>> >> >> My Blog: https://www.dbtsai.com >>>>>>>> >> >> LinkedIn: https://www.linkedin.com/in/dbtsai >>>>>>>> >> >> >>>>>>>> >> >> >>>>>>>> >> >> On Wed, Apr 23, 2014 at 9:34 PM, Shivaram Venkataraman >>>>>>>> >> >> <shiva...@eecs.berkeley.edu> wrote: >>>>>>>> >> >>> I don't think the attachment came through in the list. Could >>>>>>>> you >>>>>>>> >> >>> upload the >>>>>>>> >> >>> results somewhere and link to them ? >>>>>>>> >> >>> >>>>>>>> >> >>> >>>>>>>> >> >>> On Wed, Apr 23, 2014 at 9:32 PM, DB Tsai <dbt...@dbtsai.com> >>>>>>>> wrote: >>>>>>>> >> >>>> >>>>>>>> >> >>>> 123 features per rows, and in average, 89% are zeros. >>>>>>>> >> >>>> On Apr 23, 2014 9:31 PM, "Evan Sparks" < >>>>>>>> evan.spa...@gmail.com> wrote: >>>>>>>> >> >>>> >>>>>>>> >> >>>> > What is the number of non zeroes per row (and number of >>>>>>>> features) >>>>>>>> >> >>>> > in the >>>>>>>> >> >>>> > sparse case? We've hit some issues with breeze sparse >>>>>>>> support in >>>>>>>> >> >>>> > the >>>>>>>> >> >>>> > past >>>>>>>> >> >>>> > but for sufficiently sparse data it's still pretty good. >>>>>>>> >> >>>> > >>>>>>>> >> >>>> > > On Apr 23, 2014, at 9:21 PM, DB Tsai < >>>>>>>> dbt...@stanford.edu> wrote: >>>>>>>> >> >>>> > > >>>>>>>> >> >>>> > > Hi all, >>>>>>>> >> >>>> > > >>>>>>>> >> >>>> > > I'm benchmarking Logistic Regression in MLlib using the >>>>>>>> newly >>>>>>>> >> >>>> > > added >>>>>>>> >> >>>> > optimizer LBFGS and GD. I'm using the same dataset and >>>>>>>> the same >>>>>>>> >> >>>> > methodology >>>>>>>> >> >>>> > in this paper, >>>>>>>> http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf >>>>>>>> >> >>>> > > >>>>>>>> >> >>>> > > I want to know how Spark scale while adding workers, >>>>>>>> and how >>>>>>>> >> >>>> > > optimizers >>>>>>>> >> >>>> > and input format (sparse or dense) impact performance. >>>>>>>> >> >>>> > > >>>>>>>> >> >>>> > > The benchmark code can be found here, >>>>>>>> >> >>>> > https://github.com/dbtsai/spark-lbfgs-benchmark >>>>>>>> >> >>>> > > >>>>>>>> >> >>>> > > The first dataset I benchmarked is a9a which only has >>>>>>>> 2.2MB. I >>>>>>>> >> >>>> > duplicated the dataset, and made it 762MB to have 11M >>>>>>>> rows. This >>>>>>>> >> >>>> > dataset >>>>>>>> >> >>>> > has 123 features and 11% of the data are non-zero >>>>>>>> elements. >>>>>>>> >> >>>> > > >>>>>>>> >> >>>> > > In this benchmark, all the dataset is cached in memory. >>>>>>>> >> >>>> > > >>>>>>>> >> >>>> > > As we expect, LBFGS converges faster than GD, and at >>>>>>>> some point, >>>>>>>> >> >>>> > > no >>>>>>>> >> >>>> > matter how we push GD, it will converge slower and slower. >>>>>>>> >> >>>> > > >>>>>>>> >> >>>> > > However, it's surprising that sparse format runs slower >>>>>>>> than >>>>>>>> >> >>>> > > dense >>>>>>>> >> >>>> > format. I did see that sparse format takes significantly >>>>>>>> smaller >>>>>>>> >> >>>> > amount >>>>>>>> >> >>>> > of >>>>>>>> >> >>>> > memory in caching RDD, but sparse is 40% slower than >>>>>>>> dense. I think >>>>>>>> >> >>>> > sparse >>>>>>>> >> >>>> > should be fast since when we compute x wT, since x is >>>>>>>> sparse, we >>>>>>>> >> >>>> > can do >>>>>>>> >> >>>> > it >>>>>>>> >> >>>> > faster. I wonder if there is anything I'm doing wrong. >>>>>>>> >> >>>> > > >>>>>>>> >> >>>> > > The attachment is the benchmark result. >>>>>>>> >> >>>> > > >>>>>>>> >> >>>> > > Thanks. >>>>>>>> >> >>>> > > >>>>>>>> >> >>>> > > Sincerely, >>>>>>>> >> >>>> > > >>>>>>>> >> >>>> > > DB Tsai >>>>>>>> >> >>>> > > ------------------------------------------------------- >>>>>>>> >> >>>> > > My Blog: https://www.dbtsai.com >>>>>>>> >> >>>> > > LinkedIn: https://www.linkedin.com/in/dbtsai >>>>>>>> >> >>>> > >>>>>>>> >> >>> >>>>>>>> >> >>> >>>>>>>> > >>>>>>>> > >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >