[Scikit-learn-general] Speeding up K-means clustering model with fast approximate neighbor search methods

Maheshakya Wijewardena Wed, 09 Apr 2014 09:59:47 -0700

Hi,

Currently in scikit-learn, Expectation maximization algorithm is used in
K-means clustering model to determine optimal cluster centers and labels.
In my opinion, the best place to apply LSH based ANN methods(proposed as a
GSOC project) is at the E step of the EM algorithm. The assignments of each
data point are determined at that step for the current setting of cluster
centers.
ANN search can be applied to find nearest cluster centers of each data
point. In `sklearn.cluster.k_means_.py`, from `_labels_inertia` function,
the assignments are calculated using `_assign_labels_array` and
`_assign_labels_csr` functions. These functions choose the center with
minimum euclidean distance. Instead of that, from an ANN search, nearest
neighbors can be approximated.


This is my current plan for this. Your feedback is welcome.

Best regards,
Maheshakya
-- 
Undergraduate,
Department of Computer Science and Engineering,
Faculty of Engineering.
University of Moratuwa,
Sri Lanka

------------------------------------------------------------------------------
Put Bad Developers to Shame
Dominate Development with Jenkins Continuous Integration
Continuously Automate Build, Test & Deployment 
Start a new project now. Try Jenkins in the cloud.
http://p.sf.net/sfu/13600_Cloudbees

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

[Scikit-learn-general] Speeding up K-means clustering model with fast approximate neighbor search methods

Reply via email to