[
https://issues.apache.org/jira/browse/SPARK-14174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
zhengruifeng updated SPARK-14174:
---------------------------------
Description:
The MiniBatchKMeans is a variant of the KMeans algorithm which uses
mini-batches to reduce the computation time, while still attempting to optimise
the same objective function. Mini-batches are subsets of the input data,
randomly sampled in each training iteration. These mini-batches drastically
reduce the amount of computation required to converge to a local solution. In
contrast to other algorithms that reduce the convergence time of k-means,
mini-batch k-means produces results that are generally only slightly worse than
the standard algorithm.
Comparison of the K-Means and MiniBatchKMeans on sklearn :
http://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html#example-cluster-plot-mini-batch-kmeans-py
Since MiniBatch-KMeans with fraction=1.0 is not equal to KMeans, so I make it a
new estimator
was:
The MiniBatchKMeans is a variant of the KMeans algorithm which uses
mini-batches to reduce the computation time, while still attempting to optimise
the same objective function. Mini-batches are subsets of the input data,
randomly sampled in each training iteration. These mini-batches drastically
reduce the amount of computation required to converge to a local solution. In
contrast to other algorithms that reduce the convergence time of k-means,
mini-batch k-means produces results that are generally only slightly worse than
the standard algorithm.
I have implemented mini-batch kmeans in Mllib, and the acceleration is realy
significant.
The MiniBatch KMeans is named XMeans in following lines.
{code}
val path = "/tmp/mnist8m.scale"
val data = MLUtils.loadLibSVMFile(sc, path)
val vecs = data.map(_.features).persist()
val km = KMeans.train(data=vecs, k=10, maxIterations=10, runs=1,
initializationMode="k-means||", seed=123l)
km.computeCost(vecs)
res0: Double = 3.317029898599564E8
val xm = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1,
initializationMode="k-means||", miniBatchFraction=0.1, seed=123l)
xm.computeCost(vecs)
res1: Double = 3.3169865959604424E8
val xm2 = XMeans.train(data=vecs, k=10, maxIterations=10, runs=1,
initializationMode="k-means||", miniBatchFraction=0.01, seed=123l)
xm2.computeCost(vecs)
res2: Double = 3.317195831216454E8
{code}
The above three training all reached the max number of iterations 10.
We can see that the WSSSEs are almost the same. While their speed perfermence
have significant difference:
{code}
KMeans 2876sec
MiniBatch KMeans (fraction=0.1) 263sec
MiniBatch KMeans (fraction=0.01) 90sec
{code}
With appropriate fraction, the bigger the dataset is, the higher speedup is.
The data used above have 8,100,000 samples, 784 features. It can be downloaded
here
(https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist8m.scale.bz2)
Comparison of the K-Means and MiniBatchKMeans on sklearn :
http://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html#example-cluster-plot-mini-batch-kmeans-py
> Accelerate KMeans via Mini-Batch EM
> -----------------------------------
>
> Key: SPARK-14174
> URL: https://issues.apache.org/jira/browse/SPARK-14174
> Project: Spark
> Issue Type: Improvement
> Components: ML
> Reporter: zhengruifeng
> Attachments: MBKM.xlsx
>
>
> The MiniBatchKMeans is a variant of the KMeans algorithm which uses
> mini-batches to reduce the computation time, while still attempting to
> optimise the same objective function. Mini-batches are subsets of the input
> data, randomly sampled in each training iteration. These mini-batches
> drastically reduce the amount of computation required to converge to a local
> solution. In contrast to other algorithms that reduce the convergence time of
> k-means, mini-batch k-means produces results that are generally only slightly
> worse than the standard algorithm.
> Comparison of the K-Means and MiniBatchKMeans on sklearn :
> http://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html#example-cluster-plot-mini-batch-kmeans-py
> Since MiniBatch-KMeans with fraction=1.0 is not equal to KMeans, so I make it
> a new estimator
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]