Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

2014-04-24 Thread Xiangrui Meng
I don't think it is easy to make sparse faster than dense with this
sparsity and feature dimension. You can try rcv1.binary, which should
show the difference easily.

David, the breeze operators used here are

1. DenseVector dot SparseVector
2. axpy DenseVector SparseVector

However, the SparseVector is passed in as Vector[Double] instead of
SparseVector[Double]. It might use the axpy impl of [DenseVector,
Vector] and call activeIterator. I didn't check whether you used
multimethods on axpy.

Best,
Xiangrui

On Wed, Apr 23, 2014 at 10:35 PM, DB Tsai dbt...@stanford.edu wrote:
 The figure showing the Log-Likelihood vs Time can be found here.

 https://github.com/dbtsai/spark-lbfgs-benchmark/raw/fd703303fb1c16ef5714901739154728550becf4/result/a9a11M.pdf

 Let me know if you can not open it. Thanks.

 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai


 On Wed, Apr 23, 2014 at 9:34 PM, Shivaram Venkataraman
 shiva...@eecs.berkeley.edu wrote:
 I don't think the attachment came through in the list. Could you upload the
 results somewhere and link to them ?


 On Wed, Apr 23, 2014 at 9:32 PM, DB Tsai dbt...@dbtsai.com wrote:

 123 features per rows, and in average, 89% are zeros.
 On Apr 23, 2014 9:31 PM, Evan Sparks evan.spa...@gmail.com wrote:

  What is the number of non zeroes per row (and number of features) in the
  sparse case? We've hit some issues with breeze sparse support in the
  past
  but for sufficiently sparse data it's still pretty good.
 
   On Apr 23, 2014, at 9:21 PM, DB Tsai dbt...@stanford.edu wrote:
  
   Hi all,
  
   I'm benchmarking Logistic Regression in MLlib using the newly added
  optimizer LBFGS and GD. I'm using the same dataset and the same
  methodology
  in this paper, http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf
  
   I want to know how Spark scale while adding workers, and how
   optimizers
  and input format (sparse or dense) impact performance.
  
   The benchmark code can be found here,
  https://github.com/dbtsai/spark-lbfgs-benchmark
  
   The first dataset I benchmarked is a9a which only has 2.2MB. I
  duplicated the dataset, and made it 762MB to have 11M rows. This dataset
  has 123 features and 11% of the data are non-zero elements.
  
   In this benchmark, all the dataset is cached in memory.
  
   As we expect, LBFGS converges faster than GD, and at some point, no
  matter how we push GD, it will converge slower and slower.
  
   However, it's surprising that sparse format runs slower than dense
  format. I did see that sparse format takes significantly smaller amount
  of
  memory in caching RDD, but sparse is 40% slower than dense. I think
  sparse
  should be fast since when we compute x wT, since x is sparse, we can do
  it
  faster. I wonder if there is anything I'm doing wrong.
  
   The attachment is the benchmark result.
  
   Thanks.
  
   Sincerely,
  
   DB Tsai
   ---
   My Blog: https://www.dbtsai.com
   LinkedIn: https://www.linkedin.com/in/dbtsai
 




Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

2014-04-24 Thread Xiangrui Meng
Hi DB,

I saw you are using yarn-cluster mode for the benchmark. I tested the
yarn-cluster mode and found that YARN does not always give you the
exact number of executors requested. Just want to confirm that you've
checked the number of executors.

The second thing to check is that in the benchmark code, after you
call cache, you should also call count() to materialize the RDD. I saw
in the result, the real difference is actually at the first step.
Adding intercept is not a cheap operation for sparse vectors.

Best,
Xiangrui

On Thu, Apr 24, 2014 at 12:53 AM, Xiangrui Meng men...@gmail.com wrote:
 I don't think it is easy to make sparse faster than dense with this
 sparsity and feature dimension. You can try rcv1.binary, which should
 show the difference easily.

 David, the breeze operators used here are

 1. DenseVector dot SparseVector
 2. axpy DenseVector SparseVector

 However, the SparseVector is passed in as Vector[Double] instead of
 SparseVector[Double]. It might use the axpy impl of [DenseVector,
 Vector] and call activeIterator. I didn't check whether you used
 multimethods on axpy.

 Best,
 Xiangrui

 On Wed, Apr 23, 2014 at 10:35 PM, DB Tsai dbt...@stanford.edu wrote:
 The figure showing the Log-Likelihood vs Time can be found here.

 https://github.com/dbtsai/spark-lbfgs-benchmark/raw/fd703303fb1c16ef5714901739154728550becf4/result/a9a11M.pdf

 Let me know if you can not open it. Thanks.

 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai


 On Wed, Apr 23, 2014 at 9:34 PM, Shivaram Venkataraman
 shiva...@eecs.berkeley.edu wrote:
 I don't think the attachment came through in the list. Could you upload the
 results somewhere and link to them ?


 On Wed, Apr 23, 2014 at 9:32 PM, DB Tsai dbt...@dbtsai.com wrote:

 123 features per rows, and in average, 89% are zeros.
 On Apr 23, 2014 9:31 PM, Evan Sparks evan.spa...@gmail.com wrote:

  What is the number of non zeroes per row (and number of features) in the
  sparse case? We've hit some issues with breeze sparse support in the
  past
  but for sufficiently sparse data it's still pretty good.
 
   On Apr 23, 2014, at 9:21 PM, DB Tsai dbt...@stanford.edu wrote:
  
   Hi all,
  
   I'm benchmarking Logistic Regression in MLlib using the newly added
  optimizer LBFGS and GD. I'm using the same dataset and the same
  methodology
  in this paper, http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf
  
   I want to know how Spark scale while adding workers, and how
   optimizers
  and input format (sparse or dense) impact performance.
  
   The benchmark code can be found here,
  https://github.com/dbtsai/spark-lbfgs-benchmark
  
   The first dataset I benchmarked is a9a which only has 2.2MB. I
  duplicated the dataset, and made it 762MB to have 11M rows. This dataset
  has 123 features and 11% of the data are non-zero elements.
  
   In this benchmark, all the dataset is cached in memory.
  
   As we expect, LBFGS converges faster than GD, and at some point, no
  matter how we push GD, it will converge slower and slower.
  
   However, it's surprising that sparse format runs slower than dense
  format. I did see that sparse format takes significantly smaller amount
  of
  memory in caching RDD, but sparse is 40% slower than dense. I think
  sparse
  should be fast since when we compute x wT, since x is sparse, we can do
  it
  faster. I wonder if there is anything I'm doing wrong.
  
   The attachment is the benchmark result.
  
   Thanks.
  
   Sincerely,
  
   DB Tsai
   ---
   My Blog: https://www.dbtsai.com
   LinkedIn: https://www.linkedin.com/in/dbtsai
 




Re: Fw: Is there any way to make a quick test on some pre-commit code?

2014-04-24 Thread Prashant Sharma
Not sure but I use sbt/sbt ~compile instead of package. Any reason we use
package instead of compile(which is slightly faster ofc.)


Prashant Sharma


On Thu, Apr 24, 2014 at 1:32 PM, Patrick Wendell pwend...@gmail.com wrote:

 This is already on the wiki:

 https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools



 On Wed, Apr 23, 2014 at 6:52 PM, Nan Zhu zhunanmcg...@gmail.com wrote:

  I'm just asked by others for the same question
 
  I think Reynold gave a pretty helpful tip on this,
 
  Shall we put this on Contribute-to-Spark wiki?
 
  --
  Nan Zhu
 
 
  Forwarded message:
 
   From: Reynold Xin r...@databricks.com
   Reply To: d...@spark.incubator.apache.org
   To: d...@spark.incubator.apache.org d...@spark.incubator.apache.org
   Date: Thursday, February 6, 2014 at 7:50:57 PM
   Subject: Re: Is there any way to make a quick test on some pre-commit
  code?
  
   You can do
  
   sbt/sbt assemble-deps
  
  
   and then just run
  
   sbt/sbt package
  
   each time.
  
  
   You can even do
  
   sbt/sbt ~package
  
   for automatic incremental compilation.
  
  
  
   On Thu, Feb 6, 2014 at 4:46 PM, Nan Zhu zhunanmcg...@gmail.com(mailto:
  zhunanmcg...@gmail.com) wrote:
  
Hi, all
   
Is it always necessary to run sbt assembly when you want to test some
  code,
   
Sometimes you just repeatedly change one or two lines for some failed
  test
case, it is really time-consuming to sbt assembly every time
   
any faster way?
   
Best,
   
--
Nan Zhu
   
  
  
  
  
 
 
 



Problem creating objects through reflection

2014-04-24 Thread Piotr Kołaczkowski
Hi,

I'm working on Cassandra-Spark integration and I hit a pretty severe
problem. One of the provided functionality is mapping Cassandra rows into
objects of user-defined classes. E.g. like this:

class MyRow(val key: String, val data: Int)
sc.cassandraTable(keyspace, table).select(key, data).as[MyRow]  //
returns CassandraRDD[MyRow]

In this example CassandraRDD creates MyRow instances by reflection, i.e.
matches selected fields from Cassandra table and passes them to the
constructor.

Unfortunately this does not work in Spark REPL.
Turns out any class declared on the REPL is an inner classes, and to be
successfully created, it needs a reference to the outer object, even though
it doesn't really use anything from the outer context.

scala class SomeClass
defined class SomeClass

scala classOf[SomeClass].getConstructors()(0)
res11: java.lang.reflect.Constructor[_] = public
$iwC$$iwC$SomeClass($iwC$$iwC)

I tried passing a null as a temporary workaround, and it also doesn't work
- I get NPE.
How can I get a reference to the current outer object representing the
context of the current line?

Also, plain non-spark Scala REPL doesn't exhibit this behaviour - and
classes declared on the REPL are proper top-most classes, not inner ones.
Why?

Thanks,
Piotr







-- 
Piotr Kolaczkowski, Lead Software Engineer
pkola...@datastax.com

777 Mariners Island Blvd., Suite 510
San Mateo, CA 94404


Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

2014-04-24 Thread DB Tsai
Hi Xiangrui,

Yes, I'm using yarn-cluster mode, and I did check # of executors I
specified are the same as the actual running executors.

For caching and materialization, I've the timer in optimizer after calling
count(); as a result, the time for materialization in cache isn't in the
benchmark.

The difference you saw is actually from dense feature or sparse feature
vector. For LBFGS and GD dense feature, you can see the first iteration
takes the same time. It's true for GD.

I'm going to run rcv1.binary which only has 0.15% non-zero elements to
verify the hypothesis.


Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Thu, Apr 24, 2014 at 1:09 AM, Xiangrui Meng men...@gmail.com wrote:

 Hi DB,

 I saw you are using yarn-cluster mode for the benchmark. I tested the
 yarn-cluster mode and found that YARN does not always give you the
 exact number of executors requested. Just want to confirm that you've
 checked the number of executors.

 The second thing to check is that in the benchmark code, after you
 call cache, you should also call count() to materialize the RDD. I saw
 in the result, the real difference is actually at the first step.
 Adding intercept is not a cheap operation for sparse vectors.

 Best,
 Xiangrui

 On Thu, Apr 24, 2014 at 12:53 AM, Xiangrui Meng men...@gmail.com wrote:
  I don't think it is easy to make sparse faster than dense with this
  sparsity and feature dimension. You can try rcv1.binary, which should
  show the difference easily.
 
  David, the breeze operators used here are
 
  1. DenseVector dot SparseVector
  2. axpy DenseVector SparseVector
 
  However, the SparseVector is passed in as Vector[Double] instead of
  SparseVector[Double]. It might use the axpy impl of [DenseVector,
  Vector] and call activeIterator. I didn't check whether you used
  multimethods on axpy.
 
  Best,
  Xiangrui
 
  On Wed, Apr 23, 2014 at 10:35 PM, DB Tsai dbt...@stanford.edu wrote:
  The figure showing the Log-Likelihood vs Time can be found here.
 
 
 https://github.com/dbtsai/spark-lbfgs-benchmark/raw/fd703303fb1c16ef5714901739154728550becf4/result/a9a11M.pdf
 
  Let me know if you can not open it. Thanks.
 
  Sincerely,
 
  DB Tsai
  ---
  My Blog: https://www.dbtsai.com
  LinkedIn: https://www.linkedin.com/in/dbtsai
 
 
  On Wed, Apr 23, 2014 at 9:34 PM, Shivaram Venkataraman
  shiva...@eecs.berkeley.edu wrote:
  I don't think the attachment came through in the list. Could you
 upload the
  results somewhere and link to them ?
 
 
  On Wed, Apr 23, 2014 at 9:32 PM, DB Tsai dbt...@dbtsai.com wrote:
 
  123 features per rows, and in average, 89% are zeros.
  On Apr 23, 2014 9:31 PM, Evan Sparks evan.spa...@gmail.com wrote:
 
   What is the number of non zeroes per row (and number of features)
 in the
   sparse case? We've hit some issues with breeze sparse support in the
   past
   but for sufficiently sparse data it's still pretty good.
  
On Apr 23, 2014, at 9:21 PM, DB Tsai dbt...@stanford.edu wrote:
   
Hi all,
   
I'm benchmarking Logistic Regression in MLlib using the newly
 added
   optimizer LBFGS and GD. I'm using the same dataset and the same
   methodology
   in this paper, http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf
   
I want to know how Spark scale while adding workers, and how
optimizers
   and input format (sparse or dense) impact performance.
   
The benchmark code can be found here,
   https://github.com/dbtsai/spark-lbfgs-benchmark
   
The first dataset I benchmarked is a9a which only has 2.2MB. I
   duplicated the dataset, and made it 762MB to have 11M rows. This
 dataset
   has 123 features and 11% of the data are non-zero elements.
   
In this benchmark, all the dataset is cached in memory.
   
As we expect, LBFGS converges faster than GD, and at some point,
 no
   matter how we push GD, it will converge slower and slower.
   
However, it's surprising that sparse format runs slower than dense
   format. I did see that sparse format takes significantly smaller
 amount
   of
   memory in caching RDD, but sparse is 40% slower than dense. I think
   sparse
   should be fast since when we compute x wT, since x is sparse, we
 can do
   it
   faster. I wonder if there is anything I'm doing wrong.
   
The attachment is the benchmark result.
   
Thanks.
   
Sincerely,
   
DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai
  
 
 



Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

2014-04-24 Thread Xiangrui Meng
I don't understand why sparse falls behind dense so much at the very
first iteration. I didn't see count() is called in
https://github.com/dbtsai/spark-lbfgs-benchmark/blob/master/src/main/scala/org/apache/spark/mllib/benchmark/BinaryLogisticRegression.scala
. Maybe you have local uncommitted changes.

Best,
Xiangrui

On Thu, Apr 24, 2014 at 11:26 AM, DB Tsai dbt...@stanford.edu wrote:
 Hi Xiangrui,

 Yes, I'm using yarn-cluster mode, and I did check # of executors I specified
 are the same as the actual running executors.

 For caching and materialization, I've the timer in optimizer after calling
 count(); as a result, the time for materialization in cache isn't in the
 benchmark.

 The difference you saw is actually from dense feature or sparse feature
 vector. For LBFGS and GD dense feature, you can see the first iteration
 takes the same time. It's true for GD.

 I'm going to run rcv1.binary which only has 0.15% non-zero elements to
 verify the hypothesis.


 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai


 On Thu, Apr 24, 2014 at 1:09 AM, Xiangrui Meng men...@gmail.com wrote:

 Hi DB,

 I saw you are using yarn-cluster mode for the benchmark. I tested the
 yarn-cluster mode and found that YARN does not always give you the
 exact number of executors requested. Just want to confirm that you've
 checked the number of executors.

 The second thing to check is that in the benchmark code, after you
 call cache, you should also call count() to materialize the RDD. I saw
 in the result, the real difference is actually at the first step.
 Adding intercept is not a cheap operation for sparse vectors.

 Best,
 Xiangrui

 On Thu, Apr 24, 2014 at 12:53 AM, Xiangrui Meng men...@gmail.com wrote:
  I don't think it is easy to make sparse faster than dense with this
  sparsity and feature dimension. You can try rcv1.binary, which should
  show the difference easily.
 
  David, the breeze operators used here are
 
  1. DenseVector dot SparseVector
  2. axpy DenseVector SparseVector
 
  However, the SparseVector is passed in as Vector[Double] instead of
  SparseVector[Double]. It might use the axpy impl of [DenseVector,
  Vector] and call activeIterator. I didn't check whether you used
  multimethods on axpy.
 
  Best,
  Xiangrui
 
  On Wed, Apr 23, 2014 at 10:35 PM, DB Tsai dbt...@stanford.edu wrote:
  The figure showing the Log-Likelihood vs Time can be found here.
 
 
  https://github.com/dbtsai/spark-lbfgs-benchmark/raw/fd703303fb1c16ef5714901739154728550becf4/result/a9a11M.pdf
 
  Let me know if you can not open it. Thanks.
 
  Sincerely,
 
  DB Tsai
  ---
  My Blog: https://www.dbtsai.com
  LinkedIn: https://www.linkedin.com/in/dbtsai
 
 
  On Wed, Apr 23, 2014 at 9:34 PM, Shivaram Venkataraman
  shiva...@eecs.berkeley.edu wrote:
  I don't think the attachment came through in the list. Could you
  upload the
  results somewhere and link to them ?
 
 
  On Wed, Apr 23, 2014 at 9:32 PM, DB Tsai dbt...@dbtsai.com wrote:
 
  123 features per rows, and in average, 89% are zeros.
  On Apr 23, 2014 9:31 PM, Evan Sparks evan.spa...@gmail.com wrote:
 
   What is the number of non zeroes per row (and number of features)
   in the
   sparse case? We've hit some issues with breeze sparse support in
   the
   past
   but for sufficiently sparse data it's still pretty good.
  
On Apr 23, 2014, at 9:21 PM, DB Tsai dbt...@stanford.edu wrote:
   
Hi all,
   
I'm benchmarking Logistic Regression in MLlib using the newly
added
   optimizer LBFGS and GD. I'm using the same dataset and the same
   methodology
   in this paper, http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf
   
I want to know how Spark scale while adding workers, and how
optimizers
   and input format (sparse or dense) impact performance.
   
The benchmark code can be found here,
   https://github.com/dbtsai/spark-lbfgs-benchmark
   
The first dataset I benchmarked is a9a which only has 2.2MB. I
   duplicated the dataset, and made it 762MB to have 11M rows. This
   dataset
   has 123 features and 11% of the data are non-zero elements.
   
In this benchmark, all the dataset is cached in memory.
   
As we expect, LBFGS converges faster than GD, and at some point,
no
   matter how we push GD, it will converge slower and slower.
   
However, it's surprising that sparse format runs slower than
dense
   format. I did see that sparse format takes significantly smaller
   amount
   of
   memory in caching RDD, but sparse is 40% slower than dense. I think
   sparse
   should be fast since when we compute x wT, since x is sparse, we
   can do
   it
   faster. I wonder if there is anything I'm doing wrong.
   
The attachment is the benchmark result.
   
Thanks.
   
Sincerely,
   
DB Tsai
---

Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

2014-04-24 Thread DB Tsai
I'm doing the timer in runMiniBatchSGD after  val numExamples = data.count()

See the following. Running rcv1 dataset now, and will update soon.

val startTime = System.nanoTime()
for (i - 1 to numIterations) {
  // Sample a subset (fraction miniBatchFraction) of the total data
  // compute and sum up the subgradients on this subset (this is one
map-reduce)
  val (gradientSum, lossSum) = data.sample(false, miniBatchFraction, 42
+ i)
.aggregate((BDV.zeros[Double](weights.size), 0.0))(
  seqOp = (c, v) = (c, v) match { case ((grad, loss), (label,
features)) =
val l = gradient.compute(features, label, weights,
Vectors.fromBreeze(grad))
(grad, loss + l)
  },
  combOp = (c1, c2) = (c1, c2) match { case ((grad1, loss1),
(grad2, loss2)) =
(grad1 += grad2, loss1 + loss2)
  })

  /**
   * NOTE(Xinghao): lossSum is computed using the weights from the
previous iteration
   * and regVal is the regularization value computed in the previous
iteration as well.
   */
  stochasticLossHistory.append(lossSum / miniBatchSize + regVal)
  val update = updater.compute(
weights, Vectors.fromBreeze(gradientSum / miniBatchSize), stepSize,
i, regParam)
  weights = update._1
  regVal = update._2
  timeStamp.append(System.nanoTime() - startTime)
}






Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Thu, Apr 24, 2014 at 1:44 PM, Xiangrui Meng men...@gmail.com wrote:

 I don't understand why sparse falls behind dense so much at the very
 first iteration. I didn't see count() is called in

 https://github.com/dbtsai/spark-lbfgs-benchmark/blob/master/src/main/scala/org/apache/spark/mllib/benchmark/BinaryLogisticRegression.scala
 . Maybe you have local uncommitted changes.

 Best,
 Xiangrui

 On Thu, Apr 24, 2014 at 11:26 AM, DB Tsai dbt...@stanford.edu wrote:
  Hi Xiangrui,
 
  Yes, I'm using yarn-cluster mode, and I did check # of executors I
 specified
  are the same as the actual running executors.
 
  For caching and materialization, I've the timer in optimizer after
 calling
  count(); as a result, the time for materialization in cache isn't in the
  benchmark.
 
  The difference you saw is actually from dense feature or sparse feature
  vector. For LBFGS and GD dense feature, you can see the first iteration
  takes the same time. It's true for GD.
 
  I'm going to run rcv1.binary which only has 0.15% non-zero elements to
  verify the hypothesis.
 
 
  Sincerely,
 
  DB Tsai
  ---
  My Blog: https://www.dbtsai.com
  LinkedIn: https://www.linkedin.com/in/dbtsai
 
 
  On Thu, Apr 24, 2014 at 1:09 AM, Xiangrui Meng men...@gmail.com wrote:
 
  Hi DB,
 
  I saw you are using yarn-cluster mode for the benchmark. I tested the
  yarn-cluster mode and found that YARN does not always give you the
  exact number of executors requested. Just want to confirm that you've
  checked the number of executors.
 
  The second thing to check is that in the benchmark code, after you
  call cache, you should also call count() to materialize the RDD. I saw
  in the result, the real difference is actually at the first step.
  Adding intercept is not a cheap operation for sparse vectors.
 
  Best,
  Xiangrui
 
  On Thu, Apr 24, 2014 at 12:53 AM, Xiangrui Meng men...@gmail.com
 wrote:
   I don't think it is easy to make sparse faster than dense with this
   sparsity and feature dimension. You can try rcv1.binary, which should
   show the difference easily.
  
   David, the breeze operators used here are
  
   1. DenseVector dot SparseVector
   2. axpy DenseVector SparseVector
  
   However, the SparseVector is passed in as Vector[Double] instead of
   SparseVector[Double]. It might use the axpy impl of [DenseVector,
   Vector] and call activeIterator. I didn't check whether you used
   multimethods on axpy.
  
   Best,
   Xiangrui
  
   On Wed, Apr 23, 2014 at 10:35 PM, DB Tsai dbt...@stanford.edu
 wrote:
   The figure showing the Log-Likelihood vs Time can be found here.
  
  
  
 https://github.com/dbtsai/spark-lbfgs-benchmark/raw/fd703303fb1c16ef5714901739154728550becf4/result/a9a11M.pdf
  
   Let me know if you can not open it. Thanks.
  
   Sincerely,
  
   DB Tsai
   ---
   My Blog: https://www.dbtsai.com
   LinkedIn: https://www.linkedin.com/in/dbtsai
  
  
   On Wed, Apr 23, 2014 at 9:34 PM, Shivaram Venkataraman
   shiva...@eecs.berkeley.edu wrote:
   I don't think the attachment came through in the list. Could you
   upload the
   results somewhere and link to them ?
  
  
   On Wed, Apr 23, 2014 at 9:32 PM, DB Tsai dbt...@dbtsai.com wrote:
  
   123 features per rows, and in average, 89% are zeros.
   On Apr 23, 2014 9:31 PM, Evan Sparks evan.spa...@gmail.com
 wrote:
  
What is the number of 

Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

2014-04-24 Thread DB Tsai
rcv1.binary is too sparse (0.15% non-zero elements), so dense format will
not run due to out of memory. But sparse format runs really well.


Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Thu, Apr 24, 2014 at 1:54 PM, DB Tsai dbt...@stanford.edu wrote:

 I'm doing the timer in runMiniBatchSGD after  val numExamples =
 data.count()

 See the following. Running rcv1 dataset now, and will update soon.

 val startTime = System.nanoTime()
 for (i - 1 to numIterations) {
   // Sample a subset (fraction miniBatchFraction) of the total data
   // compute and sum up the subgradients on this subset (this is one
 map-reduce)
   val (gradientSum, lossSum) = data.sample(false, miniBatchFraction,
 42 + i)
 .aggregate((BDV.zeros[Double](weights.size), 0.0))(
   seqOp = (c, v) = (c, v) match { case ((grad, loss), (label,
 features)) =
 val l = gradient.compute(features, label, weights,
 Vectors.fromBreeze(grad))
 (grad, loss + l)
   },
   combOp = (c1, c2) = (c1, c2) match { case ((grad1, loss1),
 (grad2, loss2)) =
 (grad1 += grad2, loss1 + loss2)
   })

   /**
* NOTE(Xinghao): lossSum is computed using the weights from the
 previous iteration
* and regVal is the regularization value computed in the previous
 iteration as well.
*/
   stochasticLossHistory.append(lossSum / miniBatchSize + regVal)
   val update = updater.compute(
 weights, Vectors.fromBreeze(gradientSum / miniBatchSize),
 stepSize, i, regParam)
   weights = update._1
   regVal = update._2
   timeStamp.append(System.nanoTime() - startTime)
 }






 Sincerely,

 DB Tsai
 ---
 My Blog: https://www.dbtsai.com
 LinkedIn: https://www.linkedin.com/in/dbtsai


 On Thu, Apr 24, 2014 at 1:44 PM, Xiangrui Meng men...@gmail.com wrote:

 I don't understand why sparse falls behind dense so much at the very
 first iteration. I didn't see count() is called in

 https://github.com/dbtsai/spark-lbfgs-benchmark/blob/master/src/main/scala/org/apache/spark/mllib/benchmark/BinaryLogisticRegression.scala
 . Maybe you have local uncommitted changes.

 Best,
 Xiangrui

 On Thu, Apr 24, 2014 at 11:26 AM, DB Tsai dbt...@stanford.edu wrote:
  Hi Xiangrui,
 
  Yes, I'm using yarn-cluster mode, and I did check # of executors I
 specified
  are the same as the actual running executors.
 
  For caching and materialization, I've the timer in optimizer after
 calling
  count(); as a result, the time for materialization in cache isn't in the
  benchmark.
 
  The difference you saw is actually from dense feature or sparse feature
  vector. For LBFGS and GD dense feature, you can see the first iteration
  takes the same time. It's true for GD.
 
  I'm going to run rcv1.binary which only has 0.15% non-zero elements to
  verify the hypothesis.
 
 
  Sincerely,
 
  DB Tsai
  ---
  My Blog: https://www.dbtsai.com
  LinkedIn: https://www.linkedin.com/in/dbtsai
 
 
  On Thu, Apr 24, 2014 at 1:09 AM, Xiangrui Meng men...@gmail.com
 wrote:
 
  Hi DB,
 
  I saw you are using yarn-cluster mode for the benchmark. I tested the
  yarn-cluster mode and found that YARN does not always give you the
  exact number of executors requested. Just want to confirm that you've
  checked the number of executors.
 
  The second thing to check is that in the benchmark code, after you
  call cache, you should also call count() to materialize the RDD. I saw
  in the result, the real difference is actually at the first step.
  Adding intercept is not a cheap operation for sparse vectors.
 
  Best,
  Xiangrui
 
  On Thu, Apr 24, 2014 at 12:53 AM, Xiangrui Meng men...@gmail.com
 wrote:
   I don't think it is easy to make sparse faster than dense with this
   sparsity and feature dimension. You can try rcv1.binary, which should
   show the difference easily.
  
   David, the breeze operators used here are
  
   1. DenseVector dot SparseVector
   2. axpy DenseVector SparseVector
  
   However, the SparseVector is passed in as Vector[Double] instead of
   SparseVector[Double]. It might use the axpy impl of [DenseVector,
   Vector] and call activeIterator. I didn't check whether you used
   multimethods on axpy.
  
   Best,
   Xiangrui
  
   On Wed, Apr 23, 2014 at 10:35 PM, DB Tsai dbt...@stanford.edu
 wrote:
   The figure showing the Log-Likelihood vs Time can be found here.
  
  
  
 https://github.com/dbtsai/spark-lbfgs-benchmark/raw/fd703303fb1c16ef5714901739154728550becf4/result/a9a11M.pdf
  
   Let me know if you can not open it. Thanks.
  
   Sincerely,
  
   DB Tsai
   ---
   My Blog: https://www.dbtsai.com
   LinkedIn: https://www.linkedin.com/in/dbtsai
  
  
   On Wed, Apr 23, 2014 at 9:34 PM, 

Re: Problem creating objects through reflection

2014-04-24 Thread Michael Armbrust
The Spark REPL is slightly modified from the normal Scala REPL to prevent
work from being done twice when closures are deserialized on the workers.
 I'm not sure exactly why this causes your problem, but its probably worth
filing a JIRA about it.

Here is another issues with classes defined in the REPL.  Not sure if it is
related, but I'd be curious if the workaround helps you:
https://issues.apache.org/jira/browse/SPARK-1199

Michael


On Thu, Apr 24, 2014 at 3:14 AM, Piotr Kołaczkowski
pkola...@datastax.comwrote:

 Hi,

 I'm working on Cassandra-Spark integration and I hit a pretty severe
 problem. One of the provided functionality is mapping Cassandra rows into
 objects of user-defined classes. E.g. like this:

 class MyRow(val key: String, val data: Int)
 sc.cassandraTable(keyspace, table).select(key, data).as[MyRow]  //
 returns CassandraRDD[MyRow]

 In this example CassandraRDD creates MyRow instances by reflection, i.e.
 matches selected fields from Cassandra table and passes them to the
 constructor.

 Unfortunately this does not work in Spark REPL.
 Turns out any class declared on the REPL is an inner classes, and to be
 successfully created, it needs a reference to the outer object, even though
 it doesn't really use anything from the outer context.

 scala class SomeClass
 defined class SomeClass

 scala classOf[SomeClass].getConstructors()(0)
 res11: java.lang.reflect.Constructor[_] = public
 $iwC$$iwC$SomeClass($iwC$$iwC)

 I tried passing a null as a temporary workaround, and it also doesn't work
 - I get NPE.
 How can I get a reference to the current outer object representing the
 context of the current line?

 Also, plain non-spark Scala REPL doesn't exhibit this behaviour - and
 classes declared on the REPL are proper top-most classes, not inner ones.
 Why?

 Thanks,
 Piotr







 --
 Piotr Kolaczkowski, Lead Software Engineer
 pkola...@datastax.com

 777 Mariners Island Blvd., Suite 510
 San Mateo, CA 94404