Hi Saul, There is already a PR for the same - https://github.com/scikit-learn/scikit-learn/pull/4301.
It is under review... feel free to take part in the discussions... Thanks R On Thu, Mar 12, 2015 at 3:40 AM, Saul Spatz <saul.sp...@gmail.com> wrote: > The gap statistic is a popular method of choosing the "optimum" k in > k-means. I've written an implementation, but it doesn't appear to be useful > for my client's data, and I'd like to contribute it. I'm attaching my > script so you can see what I'm talking about. I know I have to document it, > and I need to parameterize the function. > > The gap statistic algorithm works by generating a number B of random > datasets as references and clustering them each of them with 1, 2, ..., K > clusters. The actual data is also clustered with the same number of > clusters. The statistics for the reference datasets and the actual data are > compared, and the optimum value of k is chosen based on that comparison. > > I want to make K (the maximum number of clusters) and B (the number of > reference datasets to generate) parameters of the function. Also, the user > ought to be able to choose between using KMeans or MiniBatchKMeans for the > clustering, but then he also has to be able to choose the parameters for the > clustering method. Unfortunately, the parameters (or their default values) > aren't quite the same for the same for KMeans and MiniBatchKMeans. To avoid > all these complications, I would have the user pass in a clustering object, > so the signature would be > > gap(km, K=9, B=10) > > or something like that, where km is either a KMeans or a MiniBatchKMeans > object. > > The function returns (k, gap, s) where k is the optimal k, gap is the gap > statistic, and s is the standard deviation of the dataset dispersion. > Sometimes the method fails, and in that case, the k returned is None, but > gap and s are still returned. > > The k returned is the smallest k such that gap[k] >= gap[k+1] - s[k+1]. > Right now, the function works by computing all the statistics, and then > checking for an acceptable candidate. I should probably reorganize the code > to use test and generate instead, quitting when an acceptable k is found. > > Do you have any interest in this? > > ------------------------------------------------------------------------------ > Dive into the World of Parallel Programming The Go Parallel Website, > sponsored > by Intel and developed in partnership with Slashdot Media, is your hub for > all > things parallel software development, from weekly thought leadership blogs > to > news, videos, case studies, tutorials and more. Take a look and join the > conversation now. http://goparallel.sourceforge.net/ > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > ------------------------------------------------------------------------------ Dive into the World of Parallel Programming The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general