Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result
I don't think it is easy to make sparse faster than dense with this sparsity and feature dimension. You can try rcv1.binary, which should show the difference easily. David, the breeze operators used here are 1. DenseVector dot SparseVector 2. axpy DenseVector SparseVector However, the SparseVector is passed in as Vector[Double] instead of SparseVector[Double]. It might use the axpy impl of [DenseVector, Vector] and call activeIterator. I didn't check whether you used multimethods on axpy. Best, Xiangrui On Wed, Apr 23, 2014 at 10:35 PM, DB Tsai dbt...@stanford.edu wrote: The figure showing the Log-Likelihood vs Time can be found here. https://github.com/dbtsai/spark-lbfgs-benchmark/raw/fd703303fb1c16ef5714901739154728550becf4/result/a9a11M.pdf Let me know if you can not open it. Thanks. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Wed, Apr 23, 2014 at 9:34 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: I don't think the attachment came through in the list. Could you upload the results somewhere and link to them ? On Wed, Apr 23, 2014 at 9:32 PM, DB Tsai dbt...@dbtsai.com wrote: 123 features per rows, and in average, 89% are zeros. On Apr 23, 2014 9:31 PM, Evan Sparks evan.spa...@gmail.com wrote: What is the number of non zeroes per row (and number of features) in the sparse case? We've hit some issues with breeze sparse support in the past but for sufficiently sparse data it's still pretty good. On Apr 23, 2014, at 9:21 PM, DB Tsai dbt...@stanford.edu wrote: Hi all, I'm benchmarking Logistic Regression in MLlib using the newly added optimizer LBFGS and GD. I'm using the same dataset and the same methodology in this paper, http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf I want to know how Spark scale while adding workers, and how optimizers and input format (sparse or dense) impact performance. The benchmark code can be found here, https://github.com/dbtsai/spark-lbfgs-benchmark The first dataset I benchmarked is a9a which only has 2.2MB. I duplicated the dataset, and made it 762MB to have 11M rows. This dataset has 123 features and 11% of the data are non-zero elements. In this benchmark, all the dataset is cached in memory. As we expect, LBFGS converges faster than GD, and at some point, no matter how we push GD, it will converge slower and slower. However, it's surprising that sparse format runs slower than dense format. I did see that sparse format takes significantly smaller amount of memory in caching RDD, but sparse is 40% slower than dense. I think sparse should be fast since when we compute x wT, since x is sparse, we can do it faster. I wonder if there is anything I'm doing wrong. The attachment is the benchmark result. Thanks. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai
Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result
Hi DB, I saw you are using yarn-cluster mode for the benchmark. I tested the yarn-cluster mode and found that YARN does not always give you the exact number of executors requested. Just want to confirm that you've checked the number of executors. The second thing to check is that in the benchmark code, after you call cache, you should also call count() to materialize the RDD. I saw in the result, the real difference is actually at the first step. Adding intercept is not a cheap operation for sparse vectors. Best, Xiangrui On Thu, Apr 24, 2014 at 12:53 AM, Xiangrui Meng men...@gmail.com wrote: I don't think it is easy to make sparse faster than dense with this sparsity and feature dimension. You can try rcv1.binary, which should show the difference easily. David, the breeze operators used here are 1. DenseVector dot SparseVector 2. axpy DenseVector SparseVector However, the SparseVector is passed in as Vector[Double] instead of SparseVector[Double]. It might use the axpy impl of [DenseVector, Vector] and call activeIterator. I didn't check whether you used multimethods on axpy. Best, Xiangrui On Wed, Apr 23, 2014 at 10:35 PM, DB Tsai dbt...@stanford.edu wrote: The figure showing the Log-Likelihood vs Time can be found here. https://github.com/dbtsai/spark-lbfgs-benchmark/raw/fd703303fb1c16ef5714901739154728550becf4/result/a9a11M.pdf Let me know if you can not open it. Thanks. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Wed, Apr 23, 2014 at 9:34 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: I don't think the attachment came through in the list. Could you upload the results somewhere and link to them ? On Wed, Apr 23, 2014 at 9:32 PM, DB Tsai dbt...@dbtsai.com wrote: 123 features per rows, and in average, 89% are zeros. On Apr 23, 2014 9:31 PM, Evan Sparks evan.spa...@gmail.com wrote: What is the number of non zeroes per row (and number of features) in the sparse case? We've hit some issues with breeze sparse support in the past but for sufficiently sparse data it's still pretty good. On Apr 23, 2014, at 9:21 PM, DB Tsai dbt...@stanford.edu wrote: Hi all, I'm benchmarking Logistic Regression in MLlib using the newly added optimizer LBFGS and GD. I'm using the same dataset and the same methodology in this paper, http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf I want to know how Spark scale while adding workers, and how optimizers and input format (sparse or dense) impact performance. The benchmark code can be found here, https://github.com/dbtsai/spark-lbfgs-benchmark The first dataset I benchmarked is a9a which only has 2.2MB. I duplicated the dataset, and made it 762MB to have 11M rows. This dataset has 123 features and 11% of the data are non-zero elements. In this benchmark, all the dataset is cached in memory. As we expect, LBFGS converges faster than GD, and at some point, no matter how we push GD, it will converge slower and slower. However, it's surprising that sparse format runs slower than dense format. I did see that sparse format takes significantly smaller amount of memory in caching RDD, but sparse is 40% slower than dense. I think sparse should be fast since when we compute x wT, since x is sparse, we can do it faster. I wonder if there is anything I'm doing wrong. The attachment is the benchmark result. Thanks. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai
Re: Fw: Is there any way to make a quick test on some pre-commit code?
Not sure but I use sbt/sbt ~compile instead of package. Any reason we use package instead of compile(which is slightly faster ofc.) Prashant Sharma On Thu, Apr 24, 2014 at 1:32 PM, Patrick Wendell pwend...@gmail.com wrote: This is already on the wiki: https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools On Wed, Apr 23, 2014 at 6:52 PM, Nan Zhu zhunanmcg...@gmail.com wrote: I'm just asked by others for the same question I think Reynold gave a pretty helpful tip on this, Shall we put this on Contribute-to-Spark wiki? -- Nan Zhu Forwarded message: From: Reynold Xin r...@databricks.com Reply To: d...@spark.incubator.apache.org To: d...@spark.incubator.apache.org d...@spark.incubator.apache.org Date: Thursday, February 6, 2014 at 7:50:57 PM Subject: Re: Is there any way to make a quick test on some pre-commit code? You can do sbt/sbt assemble-deps and then just run sbt/sbt package each time. You can even do sbt/sbt ~package for automatic incremental compilation. On Thu, Feb 6, 2014 at 4:46 PM, Nan Zhu zhunanmcg...@gmail.com(mailto: zhunanmcg...@gmail.com) wrote: Hi, all Is it always necessary to run sbt assembly when you want to test some code, Sometimes you just repeatedly change one or two lines for some failed test case, it is really time-consuming to sbt assembly every time any faster way? Best, -- Nan Zhu
Problem creating objects through reflection
Hi, I'm working on Cassandra-Spark integration and I hit a pretty severe problem. One of the provided functionality is mapping Cassandra rows into objects of user-defined classes. E.g. like this: class MyRow(val key: String, val data: Int) sc.cassandraTable(keyspace, table).select(key, data).as[MyRow] // returns CassandraRDD[MyRow] In this example CassandraRDD creates MyRow instances by reflection, i.e. matches selected fields from Cassandra table and passes them to the constructor. Unfortunately this does not work in Spark REPL. Turns out any class declared on the REPL is an inner classes, and to be successfully created, it needs a reference to the outer object, even though it doesn't really use anything from the outer context. scala class SomeClass defined class SomeClass scala classOf[SomeClass].getConstructors()(0) res11: java.lang.reflect.Constructor[_] = public $iwC$$iwC$SomeClass($iwC$$iwC) I tried passing a null as a temporary workaround, and it also doesn't work - I get NPE. How can I get a reference to the current outer object representing the context of the current line? Also, plain non-spark Scala REPL doesn't exhibit this behaviour - and classes declared on the REPL are proper top-most classes, not inner ones. Why? Thanks, Piotr -- Piotr Kolaczkowski, Lead Software Engineer pkola...@datastax.com 777 Mariners Island Blvd., Suite 510 San Mateo, CA 94404
Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result
Hi Xiangrui, Yes, I'm using yarn-cluster mode, and I did check # of executors I specified are the same as the actual running executors. For caching and materialization, I've the timer in optimizer after calling count(); as a result, the time for materialization in cache isn't in the benchmark. The difference you saw is actually from dense feature or sparse feature vector. For LBFGS and GD dense feature, you can see the first iteration takes the same time. It's true for GD. I'm going to run rcv1.binary which only has 0.15% non-zero elements to verify the hypothesis. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Thu, Apr 24, 2014 at 1:09 AM, Xiangrui Meng men...@gmail.com wrote: Hi DB, I saw you are using yarn-cluster mode for the benchmark. I tested the yarn-cluster mode and found that YARN does not always give you the exact number of executors requested. Just want to confirm that you've checked the number of executors. The second thing to check is that in the benchmark code, after you call cache, you should also call count() to materialize the RDD. I saw in the result, the real difference is actually at the first step. Adding intercept is not a cheap operation for sparse vectors. Best, Xiangrui On Thu, Apr 24, 2014 at 12:53 AM, Xiangrui Meng men...@gmail.com wrote: I don't think it is easy to make sparse faster than dense with this sparsity and feature dimension. You can try rcv1.binary, which should show the difference easily. David, the breeze operators used here are 1. DenseVector dot SparseVector 2. axpy DenseVector SparseVector However, the SparseVector is passed in as Vector[Double] instead of SparseVector[Double]. It might use the axpy impl of [DenseVector, Vector] and call activeIterator. I didn't check whether you used multimethods on axpy. Best, Xiangrui On Wed, Apr 23, 2014 at 10:35 PM, DB Tsai dbt...@stanford.edu wrote: The figure showing the Log-Likelihood vs Time can be found here. https://github.com/dbtsai/spark-lbfgs-benchmark/raw/fd703303fb1c16ef5714901739154728550becf4/result/a9a11M.pdf Let me know if you can not open it. Thanks. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Wed, Apr 23, 2014 at 9:34 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: I don't think the attachment came through in the list. Could you upload the results somewhere and link to them ? On Wed, Apr 23, 2014 at 9:32 PM, DB Tsai dbt...@dbtsai.com wrote: 123 features per rows, and in average, 89% are zeros. On Apr 23, 2014 9:31 PM, Evan Sparks evan.spa...@gmail.com wrote: What is the number of non zeroes per row (and number of features) in the sparse case? We've hit some issues with breeze sparse support in the past but for sufficiently sparse data it's still pretty good. On Apr 23, 2014, at 9:21 PM, DB Tsai dbt...@stanford.edu wrote: Hi all, I'm benchmarking Logistic Regression in MLlib using the newly added optimizer LBFGS and GD. I'm using the same dataset and the same methodology in this paper, http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf I want to know how Spark scale while adding workers, and how optimizers and input format (sparse or dense) impact performance. The benchmark code can be found here, https://github.com/dbtsai/spark-lbfgs-benchmark The first dataset I benchmarked is a9a which only has 2.2MB. I duplicated the dataset, and made it 762MB to have 11M rows. This dataset has 123 features and 11% of the data are non-zero elements. In this benchmark, all the dataset is cached in memory. As we expect, LBFGS converges faster than GD, and at some point, no matter how we push GD, it will converge slower and slower. However, it's surprising that sparse format runs slower than dense format. I did see that sparse format takes significantly smaller amount of memory in caching RDD, but sparse is 40% slower than dense. I think sparse should be fast since when we compute x wT, since x is sparse, we can do it faster. I wonder if there is anything I'm doing wrong. The attachment is the benchmark result. Thanks. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai
Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result
I don't understand why sparse falls behind dense so much at the very first iteration. I didn't see count() is called in https://github.com/dbtsai/spark-lbfgs-benchmark/blob/master/src/main/scala/org/apache/spark/mllib/benchmark/BinaryLogisticRegression.scala . Maybe you have local uncommitted changes. Best, Xiangrui On Thu, Apr 24, 2014 at 11:26 AM, DB Tsai dbt...@stanford.edu wrote: Hi Xiangrui, Yes, I'm using yarn-cluster mode, and I did check # of executors I specified are the same as the actual running executors. For caching and materialization, I've the timer in optimizer after calling count(); as a result, the time for materialization in cache isn't in the benchmark. The difference you saw is actually from dense feature or sparse feature vector. For LBFGS and GD dense feature, you can see the first iteration takes the same time. It's true for GD. I'm going to run rcv1.binary which only has 0.15% non-zero elements to verify the hypothesis. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Thu, Apr 24, 2014 at 1:09 AM, Xiangrui Meng men...@gmail.com wrote: Hi DB, I saw you are using yarn-cluster mode for the benchmark. I tested the yarn-cluster mode and found that YARN does not always give you the exact number of executors requested. Just want to confirm that you've checked the number of executors. The second thing to check is that in the benchmark code, after you call cache, you should also call count() to materialize the RDD. I saw in the result, the real difference is actually at the first step. Adding intercept is not a cheap operation for sparse vectors. Best, Xiangrui On Thu, Apr 24, 2014 at 12:53 AM, Xiangrui Meng men...@gmail.com wrote: I don't think it is easy to make sparse faster than dense with this sparsity and feature dimension. You can try rcv1.binary, which should show the difference easily. David, the breeze operators used here are 1. DenseVector dot SparseVector 2. axpy DenseVector SparseVector However, the SparseVector is passed in as Vector[Double] instead of SparseVector[Double]. It might use the axpy impl of [DenseVector, Vector] and call activeIterator. I didn't check whether you used multimethods on axpy. Best, Xiangrui On Wed, Apr 23, 2014 at 10:35 PM, DB Tsai dbt...@stanford.edu wrote: The figure showing the Log-Likelihood vs Time can be found here. https://github.com/dbtsai/spark-lbfgs-benchmark/raw/fd703303fb1c16ef5714901739154728550becf4/result/a9a11M.pdf Let me know if you can not open it. Thanks. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Wed, Apr 23, 2014 at 9:34 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: I don't think the attachment came through in the list. Could you upload the results somewhere and link to them ? On Wed, Apr 23, 2014 at 9:32 PM, DB Tsai dbt...@dbtsai.com wrote: 123 features per rows, and in average, 89% are zeros. On Apr 23, 2014 9:31 PM, Evan Sparks evan.spa...@gmail.com wrote: What is the number of non zeroes per row (and number of features) in the sparse case? We've hit some issues with breeze sparse support in the past but for sufficiently sparse data it's still pretty good. On Apr 23, 2014, at 9:21 PM, DB Tsai dbt...@stanford.edu wrote: Hi all, I'm benchmarking Logistic Regression in MLlib using the newly added optimizer LBFGS and GD. I'm using the same dataset and the same methodology in this paper, http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf I want to know how Spark scale while adding workers, and how optimizers and input format (sparse or dense) impact performance. The benchmark code can be found here, https://github.com/dbtsai/spark-lbfgs-benchmark The first dataset I benchmarked is a9a which only has 2.2MB. I duplicated the dataset, and made it 762MB to have 11M rows. This dataset has 123 features and 11% of the data are non-zero elements. In this benchmark, all the dataset is cached in memory. As we expect, LBFGS converges faster than GD, and at some point, no matter how we push GD, it will converge slower and slower. However, it's surprising that sparse format runs slower than dense format. I did see that sparse format takes significantly smaller amount of memory in caching RDD, but sparse is 40% slower than dense. I think sparse should be fast since when we compute x wT, since x is sparse, we can do it faster. I wonder if there is anything I'm doing wrong. The attachment is the benchmark result. Thanks. Sincerely, DB Tsai ---
Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result
I'm doing the timer in runMiniBatchSGD after val numExamples = data.count() See the following. Running rcv1 dataset now, and will update soon. val startTime = System.nanoTime() for (i - 1 to numIterations) { // Sample a subset (fraction miniBatchFraction) of the total data // compute and sum up the subgradients on this subset (this is one map-reduce) val (gradientSum, lossSum) = data.sample(false, miniBatchFraction, 42 + i) .aggregate((BDV.zeros[Double](weights.size), 0.0))( seqOp = (c, v) = (c, v) match { case ((grad, loss), (label, features)) = val l = gradient.compute(features, label, weights, Vectors.fromBreeze(grad)) (grad, loss + l) }, combOp = (c1, c2) = (c1, c2) match { case ((grad1, loss1), (grad2, loss2)) = (grad1 += grad2, loss1 + loss2) }) /** * NOTE(Xinghao): lossSum is computed using the weights from the previous iteration * and regVal is the regularization value computed in the previous iteration as well. */ stochasticLossHistory.append(lossSum / miniBatchSize + regVal) val update = updater.compute( weights, Vectors.fromBreeze(gradientSum / miniBatchSize), stepSize, i, regParam) weights = update._1 regVal = update._2 timeStamp.append(System.nanoTime() - startTime) } Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Thu, Apr 24, 2014 at 1:44 PM, Xiangrui Meng men...@gmail.com wrote: I don't understand why sparse falls behind dense so much at the very first iteration. I didn't see count() is called in https://github.com/dbtsai/spark-lbfgs-benchmark/blob/master/src/main/scala/org/apache/spark/mllib/benchmark/BinaryLogisticRegression.scala . Maybe you have local uncommitted changes. Best, Xiangrui On Thu, Apr 24, 2014 at 11:26 AM, DB Tsai dbt...@stanford.edu wrote: Hi Xiangrui, Yes, I'm using yarn-cluster mode, and I did check # of executors I specified are the same as the actual running executors. For caching and materialization, I've the timer in optimizer after calling count(); as a result, the time for materialization in cache isn't in the benchmark. The difference you saw is actually from dense feature or sparse feature vector. For LBFGS and GD dense feature, you can see the first iteration takes the same time. It's true for GD. I'm going to run rcv1.binary which only has 0.15% non-zero elements to verify the hypothesis. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Thu, Apr 24, 2014 at 1:09 AM, Xiangrui Meng men...@gmail.com wrote: Hi DB, I saw you are using yarn-cluster mode for the benchmark. I tested the yarn-cluster mode and found that YARN does not always give you the exact number of executors requested. Just want to confirm that you've checked the number of executors. The second thing to check is that in the benchmark code, after you call cache, you should also call count() to materialize the RDD. I saw in the result, the real difference is actually at the first step. Adding intercept is not a cheap operation for sparse vectors. Best, Xiangrui On Thu, Apr 24, 2014 at 12:53 AM, Xiangrui Meng men...@gmail.com wrote: I don't think it is easy to make sparse faster than dense with this sparsity and feature dimension. You can try rcv1.binary, which should show the difference easily. David, the breeze operators used here are 1. DenseVector dot SparseVector 2. axpy DenseVector SparseVector However, the SparseVector is passed in as Vector[Double] instead of SparseVector[Double]. It might use the axpy impl of [DenseVector, Vector] and call activeIterator. I didn't check whether you used multimethods on axpy. Best, Xiangrui On Wed, Apr 23, 2014 at 10:35 PM, DB Tsai dbt...@stanford.edu wrote: The figure showing the Log-Likelihood vs Time can be found here. https://github.com/dbtsai/spark-lbfgs-benchmark/raw/fd703303fb1c16ef5714901739154728550becf4/result/a9a11M.pdf Let me know if you can not open it. Thanks. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Wed, Apr 23, 2014 at 9:34 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: I don't think the attachment came through in the list. Could you upload the results somewhere and link to them ? On Wed, Apr 23, 2014 at 9:32 PM, DB Tsai dbt...@dbtsai.com wrote: 123 features per rows, and in average, 89% are zeros. On Apr 23, 2014 9:31 PM, Evan Sparks evan.spa...@gmail.com wrote: What is the number of
Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result
rcv1.binary is too sparse (0.15% non-zero elements), so dense format will not run due to out of memory. But sparse format runs really well. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Thu, Apr 24, 2014 at 1:54 PM, DB Tsai dbt...@stanford.edu wrote: I'm doing the timer in runMiniBatchSGD after val numExamples = data.count() See the following. Running rcv1 dataset now, and will update soon. val startTime = System.nanoTime() for (i - 1 to numIterations) { // Sample a subset (fraction miniBatchFraction) of the total data // compute and sum up the subgradients on this subset (this is one map-reduce) val (gradientSum, lossSum) = data.sample(false, miniBatchFraction, 42 + i) .aggregate((BDV.zeros[Double](weights.size), 0.0))( seqOp = (c, v) = (c, v) match { case ((grad, loss), (label, features)) = val l = gradient.compute(features, label, weights, Vectors.fromBreeze(grad)) (grad, loss + l) }, combOp = (c1, c2) = (c1, c2) match { case ((grad1, loss1), (grad2, loss2)) = (grad1 += grad2, loss1 + loss2) }) /** * NOTE(Xinghao): lossSum is computed using the weights from the previous iteration * and regVal is the regularization value computed in the previous iteration as well. */ stochasticLossHistory.append(lossSum / miniBatchSize + regVal) val update = updater.compute( weights, Vectors.fromBreeze(gradientSum / miniBatchSize), stepSize, i, regParam) weights = update._1 regVal = update._2 timeStamp.append(System.nanoTime() - startTime) } Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Thu, Apr 24, 2014 at 1:44 PM, Xiangrui Meng men...@gmail.com wrote: I don't understand why sparse falls behind dense so much at the very first iteration. I didn't see count() is called in https://github.com/dbtsai/spark-lbfgs-benchmark/blob/master/src/main/scala/org/apache/spark/mllib/benchmark/BinaryLogisticRegression.scala . Maybe you have local uncommitted changes. Best, Xiangrui On Thu, Apr 24, 2014 at 11:26 AM, DB Tsai dbt...@stanford.edu wrote: Hi Xiangrui, Yes, I'm using yarn-cluster mode, and I did check # of executors I specified are the same as the actual running executors. For caching and materialization, I've the timer in optimizer after calling count(); as a result, the time for materialization in cache isn't in the benchmark. The difference you saw is actually from dense feature or sparse feature vector. For LBFGS and GD dense feature, you can see the first iteration takes the same time. It's true for GD. I'm going to run rcv1.binary which only has 0.15% non-zero elements to verify the hypothesis. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Thu, Apr 24, 2014 at 1:09 AM, Xiangrui Meng men...@gmail.com wrote: Hi DB, I saw you are using yarn-cluster mode for the benchmark. I tested the yarn-cluster mode and found that YARN does not always give you the exact number of executors requested. Just want to confirm that you've checked the number of executors. The second thing to check is that in the benchmark code, after you call cache, you should also call count() to materialize the RDD. I saw in the result, the real difference is actually at the first step. Adding intercept is not a cheap operation for sparse vectors. Best, Xiangrui On Thu, Apr 24, 2014 at 12:53 AM, Xiangrui Meng men...@gmail.com wrote: I don't think it is easy to make sparse faster than dense with this sparsity and feature dimension. You can try rcv1.binary, which should show the difference easily. David, the breeze operators used here are 1. DenseVector dot SparseVector 2. axpy DenseVector SparseVector However, the SparseVector is passed in as Vector[Double] instead of SparseVector[Double]. It might use the axpy impl of [DenseVector, Vector] and call activeIterator. I didn't check whether you used multimethods on axpy. Best, Xiangrui On Wed, Apr 23, 2014 at 10:35 PM, DB Tsai dbt...@stanford.edu wrote: The figure showing the Log-Likelihood vs Time can be found here. https://github.com/dbtsai/spark-lbfgs-benchmark/raw/fd703303fb1c16ef5714901739154728550becf4/result/a9a11M.pdf Let me know if you can not open it. Thanks. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Wed, Apr 23, 2014 at 9:34 PM,
Re: Problem creating objects through reflection
The Spark REPL is slightly modified from the normal Scala REPL to prevent work from being done twice when closures are deserialized on the workers. I'm not sure exactly why this causes your problem, but its probably worth filing a JIRA about it. Here is another issues with classes defined in the REPL. Not sure if it is related, but I'd be curious if the workaround helps you: https://issues.apache.org/jira/browse/SPARK-1199 Michael On Thu, Apr 24, 2014 at 3:14 AM, Piotr Kołaczkowski pkola...@datastax.comwrote: Hi, I'm working on Cassandra-Spark integration and I hit a pretty severe problem. One of the provided functionality is mapping Cassandra rows into objects of user-defined classes. E.g. like this: class MyRow(val key: String, val data: Int) sc.cassandraTable(keyspace, table).select(key, data).as[MyRow] // returns CassandraRDD[MyRow] In this example CassandraRDD creates MyRow instances by reflection, i.e. matches selected fields from Cassandra table and passes them to the constructor. Unfortunately this does not work in Spark REPL. Turns out any class declared on the REPL is an inner classes, and to be successfully created, it needs a reference to the outer object, even though it doesn't really use anything from the outer context. scala class SomeClass defined class SomeClass scala classOf[SomeClass].getConstructors()(0) res11: java.lang.reflect.Constructor[_] = public $iwC$$iwC$SomeClass($iwC$$iwC) I tried passing a null as a temporary workaround, and it also doesn't work - I get NPE. How can I get a reference to the current outer object representing the context of the current line? Also, plain non-spark Scala REPL doesn't exhibit this behaviour - and classes declared on the REPL are proper top-most classes, not inner ones. Why? Thanks, Piotr -- Piotr Kolaczkowski, Lead Software Engineer pkola...@datastax.com 777 Mariners Island Blvd., Suite 510 San Mateo, CA 94404