zhengruifeng commented on issue #27374: [SPARK-30659][ML][PYSPARK] LogisticRegression blockify input vectors URL: https://github.com/apache/spark/pull/27374#issuecomment-579654325 @srowen I found that on small datasets, the speed up is even more significant. data: a9a, numFeatures=123, numInstances=32,561 testCode: ```scala import org.apache.spark.ml.classification._ import org.apache.spark.storage.StorageLevel val df = spark.read.format("libsvm").load("/data1/Datasets/a9a/a9a").withColumn("label", (col("label")+1)/2) df.persist(StorageLevel.MEMORY_AND_DISK) df.count val lr4 = new LogisticRegression().setMaxIter(100).setFitIntercept(false).setFamily("multinomial") val start = System.currentTimeMillis; val model4 = lr4.fit(df); val end = System.currentTimeMillis; end - start Seq(64, 256, 1024, 4096, 8192).map { b => val start = System.currentTimeMillis; val model1 = new LogisticRegression().setBlockSize(b).fit(df); val end = System.currentTimeMillis; end - start } // this PR Seq(64, 256, 1024, 4096, 8192).map { b => val start = System.currentTimeMillis; val model1 = new LogisticRegression().fit(df); val end = System.currentTimeMillis; end - start } // Master ``` result: about **44%~48%** faster. I think that is beacuse on big dataset, the communiation overhead has a bigger impact on the whole procedure; while on small datasets like a9a, high-level BLAS dominates the performance. This PR: `List(1630, 1623, 1539, 1559, 1666)` Master: `List(2985, 3037, 2957, 2994, 2959)` But the way, I set default value to 1024 base on above result. However, the best blocksize will depend on many factors like numFetaures, sparsity, numInstances, etc.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
