zhengruifeng commented on issue #27374: [SPARK-30659][ML][PYSPARK] LogisticRegression blockify input vectors URL: https://github.com/apache/spark/pull/27374#issuecomment-582898607 @WeichenXu123 @mengxr @srowen I just made a quick test on webspam: I draw the first 10,000 sample from `webspam_wc_normalized_trigram.svm`, and the numFeatures=8,289,919 in the sampled dataset; It's sparsity (percentage of non-zero values) is about 0.4489%. This PR will fail dure to OOM in standardization, so I use a patch: ```scala val vec = features match { case dv: DenseVector => var i = 0 while (i < dv.size) { val std = featuresStd(i) if (std != 0) { dv.values(i) /= std } else { dv.values(i) = 0.0 } i += 1 } dv case sv: SparseVector => var j = 0 while (j < sv.numActives) { val i = sv.indices(j) val std = featuresStd(i) if (std != 0) { sv.values(j) /= std } else { sv.values(j) = 0.0 } j += 1 } sv } ``` After that, I use following code to test performance: spark-shell --driver-memory=32G --conf spark.driver.maxResultSize=4g ```scala import org.apache.spark.ml.classification._ import org.apache.spark.storage.StorageLevel val df = spark.read.format("libsvm").load("/data1/Datasets/webspam/webspam_wc_normalized_trigram.svm.10k").withColumn("label", (col("label")+1)/2) val lr1 = new LogisticRegression().setMaxIter(100).setFamily("binomial").setBlockSize(128) // this PR val start = System.currentTimeMillis; val model1 = lr1.fit(df); val end = System.currentTimeMillis; end - start val lr2 = new LogisticRegression().setMaxIter(100).setFamily("binomial").setBlockSize(1024) // this PR val start = System.currentTimeMillis; val model2 = lr2.fit(df); val end = System.currentTimeMillis; end - start val lr = new LogisticRegression().setMaxIter(100).setFamily("binomial") // 2.4.4 val start = System.currentTimeMillis; val model = lr.fit(df); val end = System.currentTimeMillis; end - start ``` Result: |Impl| This PR(blockSize=128) | This PR(blockSize=1024) | 2.4.4 | |------|----------|------------|----------| |summary.totalIterations | 31|31|31| |duration | 298514 | 133982 | 108375 | |RAM | 425 | 425 | 396 | For this sparse dataset, this PR (with updated standardization) is about 23% slower, and use 7% more RAM. So I aggre with you to revert this PR and relative PRs [LinearSVC](https://github.com/apache/spark/pull/27360), [LinearRegression](https://github.com/apache/spark/pull/27396). Since [ALS/MLP extend HasBlockSize](https://github.com/apache/spark/pull/27389) depend on [LinearSVC](https://github.com/apache/spark/pull/27360), so may it also need to be reverted for now @huaxingao
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
