zhengruifeng commented on issue #27374: [SPARK-30659][ML][PYSPARK] 
LogisticRegression blockify input vectors
URL: https://github.com/apache/spark/pull/27374#issuecomment-582898607
 
 
   @WeichenXu123 @mengxr @srowen 
   I just made a quick test on webspam:
   I draw the first 10,000 sample from `webspam_wc_normalized_trigram.svm`, and 
the numFeatures=8,289,919 in the sampled dataset;
   It's sparsity (percentage of non-zero values) is about 0.4489%.
   
   This PR will fail dure to OOM in standardization, so I use a patch:
   ```scala
              val vec = features match {
                 case dv: DenseVector =>
                   var i = 0
                   while (i < dv.size) {
                     val std = featuresStd(i)
                     if (std != 0) {
                       dv.values(i) /= std
                     } else {
                       dv.values(i) = 0.0
                     }
                     i += 1
                   }
                   dv
                 case sv: SparseVector =>
                   var j = 0
                   while (j < sv.numActives) {
                     val i = sv.indices(j)
                     val std = featuresStd(i)
                     if (std != 0) {
                       sv.values(j) /= std
                     } else {
                       sv.values(j) = 0.0
                     }
                     j += 1
                   }
                   sv
               }
   
   ```
   
   After that, I use following code to test performance:
   spark-shell --driver-memory=32G --conf spark.driver.maxResultSize=4g
   ```scala
   import org.apache.spark.ml.classification._
   import org.apache.spark.storage.StorageLevel
   
   val df = 
spark.read.format("libsvm").load("/data1/Datasets/webspam/webspam_wc_normalized_trigram.svm.10k").withColumn("label",
 (col("label")+1)/2)
   
   
   val lr1 = new 
LogisticRegression().setMaxIter(100).setFamily("binomial").setBlockSize(128) // 
this PR
   val start = System.currentTimeMillis; val model1 = lr1.fit(df); val end = 
System.currentTimeMillis; end - start
   
   
   val lr2 = new 
LogisticRegression().setMaxIter(100).setFamily("binomial").setBlockSize(1024) 
// this PR
   val start = System.currentTimeMillis; val model2 = lr2.fit(df); val end = 
System.currentTimeMillis; end - start
   
   
   val lr = new LogisticRegression().setMaxIter(100).setFamily("binomial") // 
2.4.4
   val start = System.currentTimeMillis; val model = lr.fit(df); val end = 
System.currentTimeMillis; end - start
   
   ```
   
   Result:
   
   |Impl| This PR(blockSize=128) | This PR(blockSize=1024) | 2.4.4 |
   |------|----------|------------|----------|
   |summary.totalIterations | 31|31|31|
   |duration | 298514 | 133982 | 108375 |
   |RAM | 425 | 425 | 396 |
   
   
   For this sparse dataset, this PR (with updated standardization) is about 23% 
slower, and use 7% more RAM.
   
   So I aggre with you to revert this PR and relative PRs 
[LinearSVC](https://github.com/apache/spark/pull/27360), 
[LinearRegression](https://github.com/apache/spark/pull/27396).
   Since [ALS/MLP extend 
HasBlockSize](https://github.com/apache/spark/pull/27389) depend on 
[LinearSVC](https://github.com/apache/spark/pull/27360), so may it also need to 
be reverted for now @huaxingao 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to