To address the efficiency issue for large datasets (to some extend), we could maybe have a `clustering` argument where `clustering='pam'` or `clustering='clara'`; 'pam' should probably be the default.
In a nutshell, CLARA repeatedly draws random samples (k < n_samples), applies PAM to them, and finds the best clustering. Here are some good resources for PAM and CLARA: - PAM (Partitioning Around Medoids): https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Clustering/Partitioning_Around_Medoids_(PAM) <https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Clustering/Partitioning_Around_Medoids_(PAM)> - CLARA (Clustering for Large Applications): https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Clustering/CLARA Best, Sebastian > On Jul 31, 2015, at 12:46 PM, Andreas Mueller <t3k...@gmail.com> wrote: > > Cool. > Including the code in scikit-learn is often a bit of a process but it might > indeed be interesting. > You could just start with a pull request - or publish a gist if you don't > think you'll have time to work on the inclusion and leave that part to > someone else. > > Cheers, > Andy > > On 07/31/2015 05:38 AM, Timo Erkkilä wrote: >> That makes sense. The basic implementation is definitely short, just ~20 >> lines of code if you don't count comments etc. I can put the source code >> available so that you can judge whether it's good to take further. I am >> familiar with the documentation libraries you are using (Sphinx with Numpy >> style docstrings) in Scikit-Learn, but that's further down the line. >> >> >> Cheers, >> Timo >> >> On Fri, Jul 31, 2015 at 10:53 AM, Gael Varoquaux >> <gael.varoqu...@normalesup.org <mailto:gael.varoqu...@normalesup.org>> wrote: >> > Is it required that an algorithm, which is implemented in Scikit-Learn, >> > scales >> > well wrt n_samples? >> >> The requirement is 'be actually useful', which is something that is a bit >> hard to judge :). >> >> I think that K-medoids is bordeline on this requirement, probably on the >> right side of the border. I would tend to say that if the code clean and >> reasonnably short (that last requirement is important), it comes with >> good tests, examples and documentation, it should be possible to merge it >> in. >> >> Sorry, we are indeed being picky. It's a struggle to find the right >> feature set to keep the package maintainable while providing great value >> to our users. >> >> Gaël >> >> ------------------------------------------------------------------------------ >> _______________________________________________ >> Scikit-learn-general mailing list >> Scikit-learn-general@lists.sourceforge.net >> <mailto:Scikit-learn-general@lists.sourceforge.net> >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >> <https://lists.sourceforge.net/lists/listinfo/scikit-learn-general> >> >> >> >> ------------------------------------------------------------------------------ >> >> >> _______________________________________________ >> Scikit-learn-general mailing list >> Scikit-learn-general@lists.sourceforge.net >> <mailto:Scikit-learn-general@lists.sourceforge.net> >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >> <https://lists.sourceforge.net/lists/listinfo/scikit-learn-general> > > ------------------------------------------------------------------------------ > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
_______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general