Github user zhengruifeng commented on the issue:

    https://github.com/apache/spark/pull/11974
  
    @sethah Agree that if I/O is the bottleneck, the speedup should be small.
    The cost in results doc is computed on the whole dataset, no the sampled 
ones. Since I think comparing cost on different instances is not fair. 
    I did not run MiniBatchKmeans with a large iterations, just because I 
didn't want to make too many copy/paste operations to fill the tables...
    Following is another two tests with `k=10` , `maxIterations=20` and 
`frac=1.0, 0.1, 0.01, 0.001` (on my laptop, tests in jira is performed on a 
server)
    
    Low dense data
    ```
    scala> val n = 1000000
    n: Int = 1000000
    
    scala> val dim = 10
    dim: Int = 10
    
    scala> val rdd = sc.parallelize(1 to n).map(i => 
Vectors.dense(Array.fill(dim)(random.nextDouble()))).persist()
    rdd: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = 
MapPartitionsRDD[1] at map at <console>:41
    
    scala> rdd.count
    res0: Long = 1000000         
    
    scala> val rdd = sc.parallelize(1 to n).map(i => 
Vectors.dense(Array.fill(dim)(random.nextDouble()))).persist()
    rdd: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = 
MapPartitionsRDD[1] at map at <console>:41
    
    scala> rdd.count
    res0: Long = 1000000                                                        
    
    
    scala> Seq(1.0, 0.1, 0.01, 0.001).foreach{f => 
println(MiniBatchKMeans.train(rdd, k=10, maxIterations=20, "random", f, 
123).computeCost(rdd))}
    17/05/26 09:14:09 WARN BLAS: Failed to load implementation from: 
com.github.fommil.netlib.NativeSystemBLAS
    17/05/26 09:14:09 WARN BLAS: Failed to load implementation from: 
com.github.fommil.netlib.NativeRefBLAS
    Iter 0: Duration 2004, cost 788582.9738726234                               
    
    Iter 1: Duration 833, cost 630535.2796758788                                
    
    Iter 2: Duration 540, cost 614845.1513723353
    Iter 3: Duration 605, cost 610066.9062292713
    Iter 4: Duration 561, cost 608031.849076637
    Iter 5: Duration 556, cost 606823.855125578
    Iter 6: Duration 573, cost 606045.5575892355
    Iter 7: Duration 476, cost 605559.3562866914
    Iter 8: Duration 477, cost 605256.6012606357
    Iter 9: Duration 485, cost 605057.8933848375
    Iter 10: Duration 444, cost 604913.8680054378
    Iter 11: Duration 496, cost 604799.3876557952
    Iter 12: Duration 380, cost 604700.0366582843
    Iter 13: Duration 484, cost 604611.0252890466
    Iter 14: Duration 412, cost 604523.1312451813
    Iter 15: Duration 469, cost 604437.1852339945
    Iter 16: Duration 414, cost 604350.2820076608
    Iter 17: Duration 425, cost 604269.8453690736
    Iter 18: Duration 422, cost 604191.0100384927
    Iter 19: Duration 552, cost 604115.8521715086
    604045.2371941482
    Iter 0: Duration 107, cost 78968.67203555592
    Iter 1: Duration 161, cost 62982.950117238244
    Iter 2: Duration 145, cost 61679.984194278164
    Iter 3: Duration 182, cost 61180.225915234245
    Iter 4: Duration 153, cost 60774.909568730814
    Iter 5: Duration 140, cost 60499.18450499466
    Iter 6: Duration 111, cost 60631.84303512169
    Iter 7: Duration 140, cost 60293.24798484023
    Iter 8: Duration 114, cost 60586.99701309156
    Iter 9: Duration 151, cost 60748.94599207808
    Iter 10: Duration 148, cost 60325.21264819525
    Iter 11: Duration 169, cost 60661.49913084767
    Iter 12: Duration 149, cost 60589.54155054191
    Iter 13: Duration 103, cost 60388.33938789272
    Iter 14: Duration 115, cost 60682.58364074669
    Iter 15: Duration 108, cost 60477.18379844011
    Iter 16: Duration 125, cost 60692.99362376495
    Iter 17: Duration 125, cost 60457.083009454785
    Iter 18: Duration 151, cost 60155.58402551571
    Iter 19: Duration 149, cost 60675.606410578795
    604111.2208211605
    Iter 0: Duration 76, cost 7806.9827458185455
    Iter 1: Duration 98, cost 6324.535452971191
    Iter 2: Duration 104, cost 6202.206415357503
    Iter 3: Duration 110, cost 6072.991331049303
    Iter 4: Duration 102, cost 6152.855401385752
    Iter 5: Duration 153, cost 6141.858875200935
    Iter 6: Duration 129, cost 5987.641548753511
    Iter 7: Duration 100, cost 6074.370232167772
    Iter 8: Duration 93, cost 6087.813342965538
    Iter 9: Duration 95, cost 6048.723797127357
    Iter 10: Duration 81, cost 6068.360272434616
    Iter 11: Duration 77, cost 6026.502400302613
    Iter 12: Duration 113, cost 6075.433877179167
    Iter 13: Duration 110, cost 6116.414343219921
    Iter 14: Duration 94, cost 6115.643402410937
    Iter 15: Duration 120, cost 6018.897745377334
    Iter 16: Duration 107, cost 6073.868377665494
    Iter 17: Duration 121, cost 6049.590697175236
    Iter 18: Duration 101, cost 6002.554790134795
    Iter 19: Duration 107, cost 6182.314365817103
    604478.1009599782
    Iter 0: Duration 84, cost 720.0194354140257
    Iter 1: Duration 96, cost 628.575082152244
    Iter 2: Duration 93, cost 647.4085507383511
    Iter 3: Duration 94, cost 586.6110099943827
    Iter 4: Duration 73, cost 606.2971088014147
    Iter 5: Duration 77, cost 561.5283061240732
    Iter 6: Duration 92, cost 591.4893772291216
    Iter 7: Duration 100, cost 589.4491867389609
    Iter 8: Duration 94, cost 606.4013710450262
    Iter 9: Duration 98, cost 599.275037352022
    Iter 10: Duration 100, cost 633.685773790703
    Iter 11: Duration 90, cost 600.3590843929948
    Iter 12: Duration 100, cost 587.7426871534223
    Iter 13: Duration 110, cost 601.867530070673
    Iter 14: Duration 81, cost 582.4792460355222
    Iter 15: Duration 70, cost 566.808139887458
    Iter 16: Duration 103, cost 593.1931284344439
    Iter 17: Duration 75, cost 593.0484823003007
    Iter 18: Duration 87, cost 590.4515560012428
    Iter 19: Duration 110, cost 636.6432124538064
    606774.8878757863
    ```
    
    Low sparse data:
    ```
    scala> val dim = 1000
    dim: Int = 1000
    
    scala> val nnz = 10
    nnz: Int = 10
    
    scala> val rdd = sc.parallelize(1 to n).map(i => Vectors.sparse(dim, 
random.shuffle(0 to dim - 1).slice(0, nnz).sorted.toArray, 
Array.fill(nnz)(random.nextDouble()))).persist()
    rdd: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = 
MapPartitionsRDD[243] at map at <console>:43
    
    scala> rdd.count
    res2: Long = 1000000                                                        
    
    
    scala> 
    
    scala> Seq(1.0, 0.1, 0.01, 0.001).foreach{f => 
println(MiniBatchKMeans.train(rdd, k=10, maxIterations=20, "random", f, 
123).computeCost(rdd))}
    Iter 0: Duration 855, cost 5136755.005959613                                
    
    Iter 1: Duration 463, cost 3290593.234145827
    Iter 2: Duration 361, cost 3285988.921198346
    Iter 3: Duration 414, cost 3284159.7181967623
    Iter 4: Duration 412, cost 3283753.486046558
    Iter 5: Duration 359, cost 3283736.9857063256
    Iter 6: Duration 467, cost 3283734.350937253
    Iter 7: Duration 395, cost 3283733.8835221143
    Iter 8: Duration 485, cost 3283733.764217921
    Iter 9: Duration 388, cost 3283733.718858701
    Iter 10: Duration 388, cost 3283733.7100546346
    Iter 11: Duration 472, cost 3283733.706424741
    Iter 12: Duration 343, cost 3283733.7043986535
    Iter 13: Duration 454, cost 3283733.7030791626
    Iter 14: Duration 411, cost 3283733.7022940316
    3283733.7022940316
    Iter 0: Duration 186, cost 515407.20129455277
    Iter 1: Duration 115, cost 328799.0024563563
    Iter 2: Duration 109, cost 329612.0112105651
    Iter 3: Duration 117, cost 329472.7363659147
    Iter 4: Duration 110, cost 328308.1740654686
    Iter 5: Duration 128, cost 327555.0913579955
    Iter 6: Duration 133, cost 328570.3274713976
    Iter 7: Duration 131, cost 326772.2661423224
    Iter 8: Duration 125, cost 329411.9244852358
    Iter 9: Duration 139, cost 329443.76867151144
    Iter 10: Duration 124, cost 328177.790480869
    Iter 11: Duration 129, cost 328689.4495351914
    Iter 12: Duration 117, cost 329209.31353470264
    Iter 13: Duration 178, cost 328004.5349875497
    Iter 14: Duration 176, cost 329325.6991534679
    Iter 15: Duration 150, cost 328495.6369945316
    Iter 16: Duration 105, cost 330078.4435948311
    Iter 17: Duration 142, cost 327892.86837545113
    Iter 18: Duration 102, cost 327830.7592207911
    Iter 19: Duration 101, cost 329399.3734780577
    3284006.927754979
    Iter 0: Duration 74, cost 50827.240443426854
    Iter 1: Duration 89, cost 33098.14517086026
    Iter 2: Duration 77, cost 33310.16139043249
    Iter 3: Duration 88, cost 32502.802014248817
    Iter 4: Duration 98, cost 33314.737208046135
    Iter 5: Duration 102, cost 33413.513205555886
    Iter 6: Duration 101, cost 32513.72389025821
    Iter 7: Duration 112, cost 32844.3953815156
    Iter 8: Duration 105, cost 32983.07571425794
    Iter 9: Duration 101, cost 33060.23970665815
    Iter 10: Duration 102, cost 33210.19304032954
    Iter 11: Duration 74, cost 32761.495919892204
    Iter 12: Duration 81, cost 32968.94719485651
    Iter 13: Duration 101, cost 33236.530002761705
    Iter 14: Duration 97, cost 33148.95273308664
    Iter 15: Duration 101, cost 32854.05889019754
    Iter 16: Duration 118, cost 33125.816622864055
    Iter 17: Duration 101, cost 32967.017394099494
    Iter 18: Duration 90, cost 32625.42645823063
    Iter 19: Duration 118, cost 33449.67241159137
    3291810.8288164968
    Iter 0: Duration 81, cost 4746.085400496897
    Iter 1: Duration 161, cost 3244.616681775365
    Iter 2: Duration 123, cost 3322.3529742161836
    Iter 3: Duration 121, cost 3158.7377760566465
    Iter 4: Duration 113, cost 3269.522708407527
    Iter 5: Duration 85, cost 3102.8306222841898
    Iter 6: Duration 93, cost 3218.01263895013
    Iter 7: Duration 102, cost 3170.8045673871875
    Iter 8: Duration 88, cost 3305.5274012487007
    Iter 9: Duration 82, cost 3268.0629858657344
    Iter 10: Duration 71, cost 3491.1456725640255
    Iter 11: Duration 78, cost 3297.524532173714
    Iter 12: Duration 84, cost 3289.097873512691
    Iter 13: Duration 103, cost 3318.1276798335107
    Iter 14: Duration 118, cost 3225.800723626809
    Iter 15: Duration 108, cost 3140.657430809164
    Iter 16: Duration 113, cost 3221.9599112200685
    Iter 17: Duration 109, cost 3305.624135805366
    Iter 18: Duration 86, cost 3195.7532498689884
    Iter 19: Duration 85, cost 3451.1008608814627
    3312203.2361961133
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to