Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

2014-04-28 Thread DB Tsai
Hi David,

I got most of the stuff working, and the loss is monotonically decreasing
by getting the history from iterator of state.

However, in the costFun, I need to know what current iteration is it for
miniBatch, which means for one iteration, if optimizer calls costFun
several times for line search, it should pass the same iteration into
costFun. So I pass the lbfgs optimizer into costFun as the following code,
and try to find the current iteration in lbfgs object. Unfortunately, it
seems that the current iteration is not available in this object.

Any idea for getting this in costFun? Originally, I've a counter inside
costFun which gives the # of iterations. However, it's not what I want now
since it also counts line search.

val lbfgs = new BreezeLBFGS[BDV[Double]](maxNumIterations, numCorrections,
convergenceTol)

val costFun =
  new CostFun(data, gradient, updater, miniBatchFraction, lbfgs,
miniBatchSize)

val states = lbfgs.iterations(new CachedDiffFunction(costFun),
initialWeights.toBreeze.toDenseVector)


Thanks.


Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Mon, Apr 28, 2014 at 8:55 AM, David Hall  wrote:

> That's right.
>
> FWIW, caching should be automatic now, but it might be the version of
> Breeze you're using doesn't do that yet.
>
> Also, In breeze.util._ there's an implicit that adds a tee method to
> iterator, and also a last method. Both are useful for things like this.
>
> -- David
>
>
> On Sun, Apr 27, 2014 at 11:53 PM, DB Tsai  wrote:
>
>> I think I figure it out. Instead of calling minimize, and record the loss
>> in the DiffFunction, I should do the following.
>>
>> val states = lbfgs.iterations(new CachedDiffFunction(costFun),
>> initialWeights.toBreeze.toDenseVector)
>> states.foreach(state => lossHistory.append(state.value))
>>
>> All the losses in states should be decreasing now. Am I right?
>>
>>
>>
>> Sincerely,
>>
>> DB Tsai
>> ---
>> My Blog: https://www.dbtsai.com
>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>
>>
>> On Sun, Apr 27, 2014 at 11:31 PM, DB Tsai  wrote:
>>
>>> Also, how many failure of rejection will terminate the optimization
>>> process? How is it related to "numberOfImprovementFailures"?
>>>
>>> Thanks.
>>>
>>>
>>> Sincerely,
>>>
>>> DB Tsai
>>> ---
>>> My Blog: https://www.dbtsai.com
>>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>>
>>>
>>> On Sun, Apr 27, 2014 at 11:28 PM, DB Tsai  wrote:
>>>
 Hi David,

 I'm recording the loss history in the DiffFunction implementation, and
 that's why the rejected step is also recorded in my loss history.

 Is there any api in Breeze LBFGS to get the history which already
 excludes the reject step? Or should I just call "iterations" method and
 check "iteratingShouldStop" instead?

 Thanks.


 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai


 On Fri, Apr 25, 2014 at 3:10 PM, David Hall wrote:

> LBFGS will not take a step that sends the objective value up. It might
> try a step that is "too big" and reject it, so if you're just logging
> everything that gets tried by LBFGS, you could see that. The "iterations"
> method of the minimizer should never return an increasing objective value.
> If you're regularizing, are you including the regularizer in the objective
> value computation?
>
> GD is almost never worth your time.
>
> -- David
>
> On Fri, Apr 25, 2014 at 2:57 PM, DB Tsai  wrote:
>
>> Another interesting benchmark.
>>
>> *News20 dataset - 0.14M row, 1,355,191 features, 0.034% non-zero
>> elements.*
>>
>> LBFGS converges in 70 seconds, while GD seems to be not progressing.
>>
>> Dense feature vector will be too big to fit in the memory, so only
>> conduct the sparse benchmark.
>>
>> I saw the sometimes the loss bumps up, and it's weird for me. Since
>> the cost function of logistic regression is convex, it should be
>> monotonically decreasing.  David, any suggestion?
>>
>> The detail figure:
>>
>> https://github.com/dbtsai/spark-lbfgs-benchmark/raw/0b774682e398b4f7e0ce01a69c44000eb0e73454/result/news20.pdf
>>
>>
>> *Rcv1 dataset - 6.8M row, 677,399 features, 0.15% non-zero elements.*
>>
>> LBFGS converges in 25 seconds, while GD also seems to be not
>> progressing.
>>
>> Only conduct sparse benchmark for the same reason. I also saw the
>> loss bumps up for unknown reason.
>>
>> The detail figure:
>>
>> https://github.com/dbtsai/spark-lbfgs-benchmark/raw/0b774682e398b4f7e0ce01a69c44000eb0e73454/result/rcv1.pdf
>>
>>

Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

2014-04-28 Thread DB Tsai
Hi David,

I got most of the stuff working, and the loss is monotonically decreasing
by getting the history from iterator of state.

However, in the costFun, I need to know what current iteration is it for
miniBatch, which means for one iteration, if optimizer calls costFun
several times for line search, it should pass the same iteration into
costFun. So I pass the lbfgs optimizer into costFun as the following code,
and try to find the current iteration in lbfgs object. Unfortunately, it
seems that the current iteration is not available in this object.

Any idea for getting this in costFun? Originally, I've a counter inside
costFun which gives the # of iterations. However, it's not what I want now
since it also counts line search.

val lbfgs = new BreezeLBFGS[BDV[Double]](maxNumIterations, numCorrections,
convergenceTol)

val costFun =
  new CostFun(data, gradient, updater, miniBatchFraction, lbfgs,
miniBatchSize)

val states = lbfgs.iterations(new CachedDiffFunction(costFun),
initialWeights.toBreeze.toDenseVector)


Thanks.


Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Mon, Apr 28, 2014 at 8:55 AM, David Hall  wrote:

> That's right.
>
> FWIW, caching should be automatic now, but it might be the version of
> Breeze you're using doesn't do that yet.
>
> Also, In breeze.util._ there's an implicit that adds a tee method to
> iterator, and also a last method. Both are useful for things like this.
>
> -- David
>
>
> On Sun, Apr 27, 2014 at 11:53 PM, DB Tsai  wrote:
>
>> I think I figure it out. Instead of calling minimize, and record the loss
>> in the DiffFunction, I should do the following.
>>
>> val states = lbfgs.iterations(new CachedDiffFunction(costFun),
>> initialWeights.toBreeze.toDenseVector)
>> states.foreach(state => lossHistory.append(state.value))
>>
>> All the losses in states should be decreasing now. Am I right?
>>
>>
>>
>> Sincerely,
>>
>> DB Tsai
>> ---
>> My Blog: https://www.dbtsai.com
>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>
>>
>> On Sun, Apr 27, 2014 at 11:31 PM, DB Tsai  wrote:
>>
>>> Also, how many failure of rejection will terminate the optimization
>>> process? How is it related to "numberOfImprovementFailures"?
>>>
>>> Thanks.
>>>
>>>
>>> Sincerely,
>>>
>>> DB Tsai
>>> ---
>>> My Blog: https://www.dbtsai.com
>>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>>
>>>
>>> On Sun, Apr 27, 2014 at 11:28 PM, DB Tsai  wrote:
>>>
 Hi David,

 I'm recording the loss history in the DiffFunction implementation, and
 that's why the rejected step is also recorded in my loss history.

 Is there any api in Breeze LBFGS to get the history which already
 excludes the reject step? Or should I just call "iterations" method and
 check "iteratingShouldStop" instead?

 Thanks.


 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai


 On Fri, Apr 25, 2014 at 3:10 PM, David Hall wrote:

> LBFGS will not take a step that sends the objective value up. It might
> try a step that is "too big" and reject it, so if you're just logging
> everything that gets tried by LBFGS, you could see that. The "iterations"
> method of the minimizer should never return an increasing objective value.
> If you're regularizing, are you including the regularizer in the objective
> value computation?
>
> GD is almost never worth your time.
>
> -- David
>
> On Fri, Apr 25, 2014 at 2:57 PM, DB Tsai  wrote:
>
>> Another interesting benchmark.
>>
>> *News20 dataset - 0.14M row, 1,355,191 features, 0.034% non-zero
>> elements.*
>>
>> LBFGS converges in 70 seconds, while GD seems to be not progressing.
>>
>> Dense feature vector will be too big to fit in the memory, so only
>> conduct the sparse benchmark.
>>
>> I saw the sometimes the loss bumps up, and it's weird for me. Since
>> the cost function of logistic regression is convex, it should be
>> monotonically decreasing.  David, any suggestion?
>>
>> The detail figure:
>>
>> https://github.com/dbtsai/spark-lbfgs-benchmark/raw/0b774682e398b4f7e0ce01a69c44000eb0e73454/result/news20.pdf
>>
>>
>> *Rcv1 dataset - 6.8M row, 677,399 features, 0.15% non-zero elements.*
>>
>> LBFGS converges in 25 seconds, while GD also seems to be not
>> progressing.
>>
>> Only conduct sparse benchmark for the same reason. I also saw the
>> loss bumps up for unknown reason.
>>
>> The detail figure:
>>
>> https://github.com/dbtsai/spark-lbfgs-benchmark/raw/0b774682e398b4f7e0ce01a69c44000eb0e73454/result/rcv1.pdf
>>
>>

Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

2014-04-28 Thread David Hall
That's right.

FWIW, caching should be automatic now, but it might be the version of
Breeze you're using doesn't do that yet.

Also, In breeze.util._ there's an implicit that adds a tee method to
iterator, and also a last method. Both are useful for things like this.

-- David

On Sun, Apr 27, 2014 at 11:53 PM, DB Tsai  wrote:

> I think I figure it out. Instead of calling minimize, and record the loss
> in the DiffFunction, I should do the following.
>
> val states = lbfgs.iterations(new CachedDiffFunction(costFun),
> initialWeights.toBreeze.toDenseVector)
> states.foreach(state => lossHistory.append(state.value))
>
> All the losses in states should be decreasing now. Am I right?
>
>
>
> Sincerely,
>
> DB Tsai
> ---
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai
>
>
> On Sun, Apr 27, 2014 at 11:31 PM, DB Tsai  wrote:
>
>> Also, how many failure of rejection will terminate the optimization
>> process? How is it related to "numberOfImprovementFailures"?
>>
>> Thanks.
>>
>>
>> Sincerely,
>>
>> DB Tsai
>> ---
>> My Blog: https://www.dbtsai.com
>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>
>>
>> On Sun, Apr 27, 2014 at 11:28 PM, DB Tsai  wrote:
>>
>>> Hi David,
>>>
>>> I'm recording the loss history in the DiffFunction implementation, and
>>> that's why the rejected step is also recorded in my loss history.
>>>
>>> Is there any api in Breeze LBFGS to get the history which already
>>> excludes the reject step? Or should I just call "iterations" method and
>>> check "iteratingShouldStop" instead?
>>>
>>> Thanks.
>>>
>>>
>>> Sincerely,
>>>
>>> DB Tsai
>>> ---
>>> My Blog: https://www.dbtsai.com
>>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>>
>>>
>>> On Fri, Apr 25, 2014 at 3:10 PM, David Hall wrote:
>>>
 LBFGS will not take a step that sends the objective value up. It might
 try a step that is "too big" and reject it, so if you're just logging
 everything that gets tried by LBFGS, you could see that. The "iterations"
 method of the minimizer should never return an increasing objective value.
 If you're regularizing, are you including the regularizer in the objective
 value computation?

 GD is almost never worth your time.

 -- David

 On Fri, Apr 25, 2014 at 2:57 PM, DB Tsai  wrote:

> Another interesting benchmark.
>
> *News20 dataset - 0.14M row, 1,355,191 features, 0.034% non-zero
> elements.*
>
> LBFGS converges in 70 seconds, while GD seems to be not progressing.
>
> Dense feature vector will be too big to fit in the memory, so only
> conduct the sparse benchmark.
>
> I saw the sometimes the loss bumps up, and it's weird for me. Since
> the cost function of logistic regression is convex, it should be
> monotonically decreasing.  David, any suggestion?
>
> The detail figure:
>
> https://github.com/dbtsai/spark-lbfgs-benchmark/raw/0b774682e398b4f7e0ce01a69c44000eb0e73454/result/news20.pdf
>
>
> *Rcv1 dataset - 6.8M row, 677,399 features, 0.15% non-zero elements.*
>
> LBFGS converges in 25 seconds, while GD also seems to be not
> progressing.
>
> Only conduct sparse benchmark for the same reason. I also saw the loss
> bumps up for unknown reason.
>
> The detail figure:
>
> https://github.com/dbtsai/spark-lbfgs-benchmark/raw/0b774682e398b4f7e0ce01a69c44000eb0e73454/result/rcv1.pdf
>
>
> Sincerely,
>
> DB Tsai
> ---
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai
>
>
> On Thu, Apr 24, 2014 at 2:36 PM, DB Tsai  wrote:
>
>> rcv1.binary is too sparse (0.15% non-zero elements), so dense format
>> will not run due to out of memory. But sparse format runs really well.
>>
>>
>> Sincerely,
>>
>> DB Tsai
>> ---
>> My Blog: https://www.dbtsai.com
>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>
>>
>> On Thu, Apr 24, 2014 at 1:54 PM, DB Tsai  wrote:
>>
>>> I'm doing the timer in runMiniBatchSGD after  val numExamples =
>>> data.count()
>>>
>>> See the following. Running rcv1 dataset now, and will update soon.
>>>
>>> val startTime = System.nanoTime()
>>> for (i <- 1 to numIterations) {
>>>   // Sample a subset (fraction miniBatchFraction) of the total
>>> data
>>>   // compute and sum up the subgradients on this subset (this is
>>> one map-reduce)
>>>   val (gradientSum, lossSum) = data.sample(false,
>>> miniBatchFraction, 42 + i)
>>> .aggregate((BDV.zeros[Double](weights.size), 0.0))(
>>>   seqOp = (c, v) => (c, v) match { c

Re: thoughts on spark_ec2.py?

2014-04-28 Thread Art Peel
Thanks for the info and good luck with 1.0.

Regards,
Art



On Fri, Apr 25, 2014 at 9:48 AM, Andrew Or  wrote:

> Hi Art,
>
> First of all thanks a lot for your PRs. We are currently in the middle of
> all the Spark 1.0 release so most of us are swamped with the more core
> features. To answer your questions:
>
> 1. Neither. We welcome changes from developers for all components of Spark,
> including the EC2 scripts. Once the release is out we will have more time
> to review the many PRs that we missed on the ride.
>
> 2. We prefer to keep the EC2 scripts within Spark, at least for now.
>
> Cheers,
> Andrew
>
> On Friday, April 25, 2014, Art Peel  wrote:
>
> > I've been setting up Spark cluster on EC2 using the provided
> > ec2/spark_ec2.py script and am very happy I didn't have to write it from
> > scratch. Thanks for providing it.
> >
> > There have been some issues, though, and I have had to make some
> additions.
> >  So far, they are all additions of command-line options.  For example,
> the
> > original script allows access from anywhere to the various ports.  I've
> > added an option to specify what net/mask should be allowed to access
> those
> > ports.
> >
> > I've filed a couple of pull requests, but they are not going anywhere.
> >  Given what I've seen of the traffic on this list, I don't feel that a
> lot
> > of the developers are thinking about EC2 setup. I totally agree that it
> is
> > not as important as improving the guts of Spark itself; nevertheless, I
> > feel that being able to run Spark on EC2 smartly and easily is valuable.
> >
> > So, I have 2 questions for the committers:
> >
> > 1. Is ec2/spark_ec2.py something the committers
> > a. are not thinking about?
> > b. are planning to replace?
> > c. other
> >
> > 2. Should I just start a new project based on ec2/spark_ec2.py but
> without
> > all the other stuff and make (and share) my changes there?
> >
> > Regards,
> >
> > Art
> >
>


Re: Parsing wikipedia xml data in Spark

2014-04-28 Thread Geoffroy Fouquier


We did it using scala xml with spark

We start by creating a rdd containing each page is store as a single line :
  - split the xml dump with xml_split
  - process each split with a shell script which remove "xml_split" tag 
and siteinfo section, and put each page on a single line.

  - copy resulting files on hdfs

Then the dataset may be load as a text file and processed

 val rawDataset = sparkContext.textFile(input)
 val allDocuments = rawDataset.map{
case document =>
val page = scala.xml.XML.loadString(document)
val pageTitle = (page \ "title").text
[...]
 }

We create a demo using the dataset here: http://wikinsights.org

Le 26/04/2014 23:20, Ajay Nair a écrit :

Is there a way in spark to parse wikipedia xml dump? It seems like the
freebase dump is longer available. Also does the spark shell support the
xml load file sax parser that is present in scala.

Thanks
AJ



Geoffroy Fouquier
http://eXenSa.com