[GitHub] [spark] zhengruifeng opened a new pull request #25178: [SPARK-28421][ML] SparseVector.apply performance optimization

GitBox Wed, 17 Jul 2019 03:57:33 -0700

zhengruifeng opened a new pull request #25178: [SPARK-28421][ML] 
SparseVector.apply performance optimization
URL: https://github.com/apache/spark/pull/25178
 
 
   ## What changes were proposed in this pull request?
   optimize the `SparseVector.apply` by avoiding internal conversion
   Since the speed up is significant (2.5X ~ 5X), and this method is widely 
used in ml, I suggest back porting.
   
   | size|  nnz | apply(old) | apply2(new impl) |
   |------|----------|------------|----------|
   |10000000|100|76243|15354|
   |10000000|10000|72327|19664|
   |10000000|1000000|85104|33086|
   
   ## How was this patch tested?
   existing tests
   
   using following code to test performance (here the new impled apply is named 
apply2):
   ```
   import scala.util.Random
   import org.apache.spark.ml.linalg._
   
   val size = 10000000
   for (nnz <- Seq(100, 10000, 1000000)) {
        val rng = new Random(123)
        val indices = Array.fill(nnz + nnz)(rng.nextInt.abs % 
size).distinct.sorted.take(nnz)
        val values = Array.fill(nnz)(rng.nextDouble)
        val vec = Vectors.sparse(size, indices, values).toSparse
   
        val tic1 = System.currentTimeMillis;
        (0 until 100).foreach{ round => var i = 0; var sum = 0.0; while(i < 
size) {sum+=vec(i); i+=1} };
        val toc1 = System.currentTimeMillis;
   
        val tic2 = System.currentTimeMillis;
        (0 until 100).foreach{ round => var i = 0; var sum = 0.0; while(i < 
size) {sum+=vec.apply2(i); i+=1} };
        val toc2 = System.currentTimeMillis;
   
        println((size, nnz, toc1 - tic1, toc2 - tic2))
   }
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] zhengruifeng opened a new pull request #25178: [SPARK-28421][ML] SparseVector.apply performance optimization

Reply via email to