Is there any interest in adding divisive hierarchical clustering algorithms
to scikit-learn? They are useful for document clustering [1] and biostats
[2], and can have much better time complexity than agglomerative approaches
([1], can run in ~O(n*log(k)), where k is the number of clusters). This
algorithm also allows one to do learning that only includes information
from a certain sub-cluster, like rebuilding a tf-idf corpus at each
hierarchy level, and allows for more sophisticated stopping criteria than
number of clusters.

There's a rough (and incomplete) implementation at
https://github.com/schets/scikit-learn/blob/splitting-hierarchical-clustering/sklearn/cluster/splitting_hierarchical.py,
that allows for a variety of divisive algorithms to be readily used
including the algorithms from both [1] and [2].

I've also had great success using the generated tree hierarchies as a sort
of template for aggregating data / dynamic clustering, although I'm not
sure how well that would fit with the standard scikit-learn api or how
common my use case is.

1.
http://www.researchgate.net/profile/Vipin_Kumar26/publication/2628533_A_Comparison_of_Document_Clustering_Techniques/links/00b4951675a8a82fcc000000.pdf
2. http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0002247

Best,
Sam
------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to