zhengruifeng commented on issue #27758: [SPARK-31007][ML] KMeans optimization 
based on triangle-inequality
URL: https://github.com/apache/spark/pull/27758#issuecomment-593339550
 
 
   env: bin/spark-shell --driver-memory=64G
   
   testCode:
   ```scala
   import org.apache.spark.ml.linalg._
   import org.apache.spark.ml.clustering._
   
   val df = 
spark.read.format("libsvm").load("/data1/Datasets/webspam/webspam_wc_normalized_trigram.svm.10k")
   df.persist()
   
   val km = new KMeans().setK(16).setMaxIter(5)
   val kmm = km.fit(df)
   
   val results = Seq(2,4,8,16).map { k => val km = new 
KMeans().setK(k).setMaxIter(20); val start = System.currentTimeMillis; val kmm 
= km.fit(df); val end = System.currentTimeMillis; val train = end - start;  
kmm.transform(df).count; val start2 = System.currentTimeMillis; 
kmm.transform(df).count; val end2 = System.currentTimeMillis; val predict = 
end2 - start2; val result = (k, train, kmm.summary.numIter, 
kmm.summary.trainingCost, predict); println(result); result }
   
   
   val df2 = spark.read.format("libsvm").load("/data1/Datasets/a9a/a9a")
   df2.persist()
   
   val km = new KMeans().setK(16).setMaxIter(10)
   val kmm = km.fit(df2)
   
   val results2 = Seq(2,4,8,16,32,64,128,256).map { k => val km = new 
KMeans().setK(k).setMaxIter(20); val start = System.currentTimeMillis; val kmm 
= km.fit(df2); val end = System.currentTimeMillis; val train = end - start;  
kmm.transform(df2).count; val start2 = System.currentTimeMillis; 
kmm.transform(df).count; val end2 = System.currentTimeMillis; val predict = 
end2 - start2; val result = (k, train, kmm.summary.numIter, 
kmm.summary.trainingCost, predict); println(result); result }
   ```
   
   1, sparse dataset: webspam, numInstances: 10,000, numFeatures: 8,289,919
   
   |Test on webspam| This PR(k=2) | This PR(k=4) | This PR(k=8) | This PR(k=16) 
| Master(k=2) | Master(k=4) | Master(k=8) | Master(k=16)  |
   
|------|----------|------------|----------|------------|----------|------------|----------|------------|
   |Train Duration 
(sec)|27.602|47.932|145.430|371.239|27.042|46.630|136.205|350.788|
   |Radii Computation (sec)|0.097|0.651|5.558|26.059|-|-|-|-|
   |NumIters|9|10|18|20|9|10|18|20|
   
|Cost|5701.39583824928|4518.01202129673|4013.6152754360096|3520.6545815340055|5701.39583824928|4518.01202129673|4013.6152754360096|3520.6545815340064|
   |Prediction Duration (millsec)|31|31|30|30|36|33|41|43|
   
   
   2, dense dataset: a9a, numInstances: 32,561, numFeatures: 123
   
   |Test on a9a| This PR(k=2) | This PR(k=4) | This PR(k=8) | This PR(k=16) | 
This PR(k=32) | This PR(k=64) | This PR(k=128) | This PR(k=256) | Master(k=2) | 
Master(k=4) | Master(k=8) | Master(k=16)  | Master(k=32) | Master(k=64) | 
Master(k=238) | Master(k=3566)  |
   
|------|----------|------------|----------|------------|----------|------------|----------|------------|----------|------------|----------|------------|----------|------------|----------|------------|
   |Train Duration 
(sec)|0.465|1.411|0.957|1.109|1.387|2.137|3.891|8.373|0.484|0.758|1.065|1.3|1.577|2.413|4.483|9.616|
   |Radii Computation 
(sec)|0.000|0.000|0.000|0.001|0.002|0.007|0.024|0.093|-|-|-|-|-|-|-|-|
   |NumIters|5|14|20|20|20|20|20|20|5|14|20|20|20|20|20|20|
   
|Cost|223377.07442855154|208261.6049327395|183833.73210801493|166964.5700618612|151753.31986776151|137289.24733092127|122693.39508665689|110459.53320745505|223377.07442855154|208261.6049327395|183833.73210801493|166964.5700618612|151753.31986776151|137289.24733092127|122693.39508665689|110459.53320745505|
   |Prediction Duration 
(millsec)|32|33|31|30|30|30|30|31|39|36|38|33|32|29|35|31|
   
   conclusion:
   1, the convergence of both impl are almost the same.
   2, new impl of prediction is likely to be faster than existing impl (after 
lazy computation of radii);
   3, computation of radii may take a few seconds, so training maybe slower 
than existing impl in some case (few instances, large k, large dim, then the 
computation of radii matters in the whole training), for example, webspam with 
k=16, new impl took 371.239 sec, while existing impl took 350.788 sec. That is 
because the computation of radii (in all 20 iterations) totally took 26.059 sec.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to