[ 
https://issues.apache.org/jira/browse/SPARK-32107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-32107.
--------------------------------------
    Resolution: Invalid

> Dask faster than Spark with a lot less iterations and better accuracy
> ---------------------------------------------------------------------
>
>                 Key: SPARK-32107
>                 URL: https://issues.apache.org/jira/browse/SPARK-32107
>             Project: Spark
>          Issue Type: Question
>          Components: MLlib
>    Affects Versions: 2.4.5
>         Environment: Anaconda for Windows with PySpark 2.4.5
>            Reporter: Julian
>            Priority: Minor
>
> Hello,
> I'm benchmarking k-means clustering Dask versus Spark.
> Right now these are only benchmarks on my laptop, but I've some interesting 
> results and I'm looking for an explanation before I further benchmark this 
> algorithm on a cluster.
> I've logged the execution time, model cluster predictions, iterations. Both 
> benchmarks used the same data with 1.6 million rows.
> The questions are:
>  * Why does Spark need a lot more iterations than Dask?
>  * Why is clustering less accurate in Spark than in Dask?
> I'm unclear why those are different, because they both use the same 
> underlying algorithm and have more or less the same standard parameter.
> *Dask*
> KMeans( n_clusters=8, init='k-means||', oversampling_factor=2, max_iter=300, 
> tol=0.0001, precompute_distances='auto', random_state=None, copy_x=True, 
> n_jobs=1, algorithm='full', init_max_iter=None, )
> *Spark*
>  I've set maxIter to 300 and reset the seed for every benchmark.
> KMeans( featuresCol='features', predictionCol='prediction', k=2, 
> initMode='k-means||', initSteps=2, tol=0.0001, maxIter=20, seed=None, 
> distanceMeasure='euclidean', )
> Here you can see the duration of execution of each k-means clustering 
> together with the iterations used to get a result. Spark is a lot slower than 
> Spark on the overall calculation, but needs also a lot more iterations. 
> Interestingly Spark is faster per iteration (the slope of a regression line) 
> and faster on initialization (the y-intercept of the regression line). For 
> the Spark benchmarks one can also make out a second line which I couldn't yet 
> explain.
> [!https://user-images.githubusercontent.com/31596773/85844596-4564af00-b7a3-11ea-90fb-9c525d9afaad.png!|https://user-images.githubusercontent.com/31596773/85844596-4564af00-b7a3-11ea-90fb-9c525d9afaad.png]
> The training data is equally spaced grid. The circles around the cluster 
> centers are the standard deviation. Clusters are overlapping and it is 
> impossible to get a hundred percent accuracy. The red markers are the 
> predicted cluster centers and the arrow shows their correspoding cluster 
> center. In this example the clustering is not correct. One cluster was on the 
> wrong spot and two predicted cluster centers share one cluster center. I can 
> make these plots for all models.
> [!https://user-images.githubusercontent.com/31596773/85845362-6974c000-b7a4-11ea-9709-4b32833fe238.png!|https://user-images.githubusercontent.com/31596773/85845362-6974c000-b7a4-11ea-9709-4b32833fe238.png]
> The graph on the right makes everything much weirder. Apperently the Spark 
> implementation is less accurate than the Dask implementation. Also you can 
> see the distribution of the duration and iterations much butter (These are 
> seaborn boxenplots).
> [!https://user-images.githubusercontent.com/31596773/85865158-c2088500-b7c5-11ea-83c2-dbd6808338a5.png!|https://user-images.githubusercontent.com/31596773/85865158-c2088500-b7c5-11ea-83c2-dbd6808338a5.png]
> I'm using Anaconda for Windows and PySpark 2.4.5 and Dask 2.5.2.
> I filed this issue for [Dask|https://github.com/dask/dask-ml/issues/686] and 
> [Spark|https://issues.apache.org/jira/browse/SPARK-32107].
> Best regards
>  Julian



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to