Hi, I am trying to implement Gap statistics on Spark, which aims to determine actual number of clusters for KMeans. Giving the range of possible Ks (e.g. K = 10), Gap will run KMeans for each K in the range. Since computation of Kmeans for K=10 takes more time than K=1,2,3,4 together, I would like to partition computation so K=10 is on one node and K=1,2,3,4 is together on another node.
As an example, if I want 3 partitions and if I have the following Ks [1,2,3,4,5,6,7,8,9,10], I would like to get [1,2,3,4,5,6] [7,8] [9,10] after partitionBy(3, partitionFunc). Any suggestions how could I implement partition function for PartitionBy in order to achieve something like this? I've implemented partition function to divide Ks into 2 buckets, based on Greedy algorithm. But I don't get how to control which element goes into which bucket from partition function. Any suggestions would be very helpful. Thank you.
