Hi,

I am trying to implement Gap statistics on Spark, which aims to determine
actual number of clusters for KMeans. Giving the range of possible Ks (e.g.
K = 10), Gap will run KMeans for each K in the range. Since computation of
Kmeans for K=10 takes more time than K=1,2,3,4 together, I would like to
partition computation so K=10 is on one node and K=1,2,3,4 is together on
another node.

As an example, if I want 3 partitions and if I have the following Ks
[1,2,3,4,5,6,7,8,9,10], I would like to get [1,2,3,4,5,6] [7,8] [9,10]
after partitionBy(3, partitionFunc).

Any suggestions how could I implement partition function for PartitionBy in
order to achieve something like this? I've implemented partition function
to divide Ks into 2 buckets, based on Greedy algorithm. But I don't get how
to control which element goes into which bucket from partition function.

Any suggestions would be very helpful.

Thank you.

Reply via email to