[
https://issues.apache.org/jira/browse/SPARK-18356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674161#comment-15674161
]
zakaria hili commented on SPARK-18356:
--------------------------------------
Hi [~yuhaoyan],
I tried to improve the Kmeans using the same concept of caching in Logistic
Regression.
https://github.com/ZakariaHili/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala#L310
and my result of performances:
I used only one VM (Local Mode) with python
-> Spark without improvement: the training takes ~0,605s (as a mean value)
-> Spark with Kmeans improved: ~0,518s (the warning disappeared)
so we can say that we did not gain a lot, but maybe we will see the difference
if we run the train method many times.
what do you think ?
> Issue + Resolution: Kmeans Spark Performances (ML package)
> ----------------------------------------------------------
>
> Key: SPARK-18356
> URL: https://issues.apache.org/jira/browse/SPARK-18356
> Project: Spark
> Issue Type: Improvement
> Components: ML
> Affects Versions: 2.0.0, 2.0.1
> Reporter: zakaria hili
> Priority: Minor
> Labels: easyfix
>
> Hello,
> I'm newbie in spark, but I think that I found a small problem that can affect
> spark Kmeans performances.
> Before starting to explain the problem, I want to explain the warning that I
> faced.
> I tried to use Spark Kmeans with Dataframes to cluster my data
> df_Part = assembler.transform(df_Part)
> df_Part.cache()
> while (k<=max_cluster) and (wssse > seuilStop):
> kmeans = KMeans().setK(k)
> model = kmeans.fit(df_Part)
> wssse = model.computeCost(df_Part)
> k=k+1
> but when I run the code I receive the warning :
> WARN KMeans: The input data is not directly cached, which may hurt
> performance if its parent RDDs are also uncached.
> I searched in spark source code to find the source of this problem, then I
> realized there is two classes responsible for this warning:
> (mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala )
> (mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala )
>
> When my dataframe is cached, the fit method transform my dataframe into an
> internally rdd which is not cached.
> Dataframe -> rdd -> run Training Kmeans Algo(rdd)
> -> The first class (ml package) responsible for converting the dataframe into
> rdd then call Kmeans Algorithm
> ->The second class (mllib package) implements Kmeans Algorithm, and here
> spark verify if the rdd is cached, if not a warning will be generated.
> So, the solution of this problem is to cache the rdd before running Kmeans
> Algorithm.
> https://github.com/ZakariaHili/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
> All what we need is to add two lines:
> Cache rdd just after dataframe transformation, then uncached it after
> training algorithm.
> I hope that I was clear.
> If you think that I was wrong, please let me know.
> Sincerely,
> Zakaria HILI
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]