zhengruifeng edited a comment on issue #27758: [SPARK-31007][ML] KMeans optimization based on triangle-inequality URL: https://github.com/apache/spark/pull/27758#issuecomment-593339550 env: bin/spark-shell --driver-memory=64G testCode: ```scala import org.apache.spark.ml.linalg._ import org.apache.spark.ml.clustering._ val df = spark.read.format("libsvm").load("/data1/Datasets/webspam/webspam_wc_normalized_trigram.svm.10k") df.persist() val km = new KMeans().setK(16).setMaxIter(5) val kmm = km.fit(df) val results = Seq(2,4,8,16).map { k => val km = new KMeans().setK(k).setMaxIter(20); val start = System.currentTimeMillis; val kmm = km.fit(df); val end = System.currentTimeMillis; val train = end - start; kmm.transform(df).count; // trigger the lazy computation of radii val start2 = System.currentTimeMillis; kmm.transform(df).count; val end2 = System.currentTimeMillis; val predict = end2 - start2; val result = (k, train, kmm.summary.numIter, kmm.summary.trainingCost, predict); println(result); result } val df2 = spark.read.format("libsvm").load("/data1/Datasets/a9a/a9a") df2.persist() val km = new KMeans().setK(16).setMaxIter(10) val kmm = km.fit(df2) val results2 = Seq(2,4,8,16,32,64,128,256).map { k => val km = new KMeans().setK(k).setMaxIter(20); val start = System.currentTimeMillis; val kmm = km.fit(df2); val end = System.currentTimeMillis; val train = end - start; kmm.transform(df2).count; val start2 = System.currentTimeMillis; kmm.transform(df).count; val end2 = System.currentTimeMillis; val predict = end2 - start2; val result = (k, train, kmm.summary.numIter, kmm.summary.trainingCost, predict); println(result); result } ``` 1, sparse dataset: webspam, numInstances: 10,000, numFeatures: 8,289,919 |Test on webspam| This PR(k=2) | This PR(k=4) | This PR(k=8) | This PR(k=16) | Master(k=2) | Master(k=4) | Master(k=8) | Master(k=16) | |------|----------|------------|----------|------------|----------|------------|----------|------------| |Train Duration (sec)|27.602|47.932|145.430|371.239|27.042|46.630|136.205|350.788| |Radii Computation (sec)|0.097|0.651|5.558|26.059|-|-|-|-| |NumIters|9|10|18|20|9|10|18|20| |Cost|5701.39583824928|4518.01202129673|4013.6152754360096|3520.6545815340055|5701.39583824928|4518.01202129673|4013.6152754360096|3520.6545815340064| |Prediction Duration (millsec)|31|31|30|30|36|33|41|43| 2, dense dataset: a9a, numInstances: 32,561, numFeatures: 123 |Test on a9a| This PR(k=2) | This PR(k=4) | This PR(k=8) | This PR(k=16) | This PR(k=32) | This PR(k=64) | This PR(k=128) | This PR(k=256) | Master(k=2) | Master(k=4) | Master(k=8) | Master(k=16) | Master(k=32) | Master(k=64) | Master(k=238) | Master(k=3566) | |------|----------|------------|----------|------------|----------|------------|----------|------------|----------|------------|----------|------------|----------|------------|----------|------------| |Train Duration (sec)|0.465|1.411|0.957|1.109|1.387|2.137|3.891|8.373|0.484|0.758|1.065|1.3|1.577|2.413|4.483|9.616| |Radii Computation (sec)|0.000|0.000|0.000|0.001|0.002|0.007|0.024|0.093|-|-|-|-|-|-|-|-| |NumIters|5|14|20|20|20|20|20|20|5|14|20|20|20|20|20|20| |Cost|223377.07442855154|208261.6049327395|183833.73210801493|166964.5700618612|151753.31986776151|137289.24733092127|122693.39508665689|110459.53320745505|223377.07442855154|208261.6049327395|183833.73210801493|166964.5700618612|151753.31986776151|137289.24733092127|122693.39508665689|110459.53320745505| |Prediction Duration (millsec)|32|33|31|30|30|30|30|31|39|36|38|33|32|29|35|31| conclusion: 1, the convergence of both impl are almost the same. 2, new impl of prediction is likely to be faster than existing impl (after lazy computation of radii); 3, computation of radii may take a few seconds, so training maybe slower than existing impl in some case (few instances, large k, large dim, then the computation of radii matters in the whole training), for example, webspam with k=16, new impl took 371.239 sec, while existing impl took 350.788 sec. That is because the computation of radii (in all 20 iterations) totally took 26.059 sec.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
