K doesn't matter much I've tried anything from 2^10 to 10^3 and the performance doesn't change much as measured by precision @ K. (see table 1 http://machinelearning.wustl.edu/mlpapers/papers/weston13). Although 10^3 kmeans did outperform 2^10 hierarchical SVD slightly in terms of the metrics, 2^10 SVD was much faster in terms of inference time.
I found the thing that affected performance most was adding in back tracking to fix mistakes made at higher levels rather than how the K is picked per level. On Tue, Jul 8, 2014 at 1:50 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote: > sure. more interesting problem here is choosing k at each level. Kernel > methods seem to be most promising. > > > On Tue, Jul 8, 2014 at 1:31 PM, Hector Yee <hector....@gmail.com> wrote: > > > No idea, never looked it up. Always just implemented it as doing k-means > > again on each cluster. > > > > FWIW standard k-means with euclidean distance has problems too with some > > dimensionality reduction methods. Swapping out the distance metric with > > negative dot or cosine may help. > > > > Other more useful clustering would be hierarchical SVD. The reason why I > > like hierarchical clustering is it makes for faster inference especially > > over billions of users. > > > > > > On Tue, Jul 8, 2014 at 1:24 PM, Dmitriy Lyubimov <dlie...@gmail.com> > > wrote: > > > > > Hector, could you share the references for hierarchical K-means? > thanks. > > > > > > > > > On Tue, Jul 8, 2014 at 1:01 PM, Hector Yee <hector....@gmail.com> > wrote: > > > > > > > I would say for bigdata applications the most useful would be > > > hierarchical > > > > k-means with back tracking and the ability to support k nearest > > > centroids. > > > > > > > > > > > > On Tue, Jul 8, 2014 at 10:54 AM, RJ Nowling <rnowl...@gmail.com> > > wrote: > > > > > > > > > Hi all, > > > > > > > > > > MLlib currently has one clustering algorithm implementation, > KMeans. > > > > > It would benefit from having implementations of other clustering > > > > > algorithms such as MiniBatch KMeans, Fuzzy C-Means, Hierarchical > > > > > Clustering, and Affinity Propagation. > > > > > > > > > > I recently submitted a PR [1] for a MiniBatch KMeans > implementation, > > > > > and I saw an email on this list about interest in implementing > Fuzzy > > > > > C-Means. > > > > > > > > > > Based on Sean Owen's review of my MiniBatch KMeans code, it became > > > > > apparent that before I implement more clustering algorithms, it > would > > > > > be useful to hammer out a framework to reduce code duplication and > > > > > implement a consistent API. > > > > > > > > > > I'd like to gauge the interest and goals of the MLlib community: > > > > > > > > > > 1. Are you interested in having more clustering algorithms > available? > > > > > > > > > > 2. Is the community interested in specifying a common framework? > > > > > > > > > > Thanks! > > > > > RJ > > > > > > > > > > [1] - https://github.com/apache/spark/pull/1248 > > > > > > > > > > > > > > > -- > > > > > em rnowl...@gmail.com > > > > > c 954.496.2314 > > > > > > > > > > > > > > > > > > > > > -- > > > > Yee Yang Li Hector <http://google.com/+HectorYee> > > > > *google.com/+HectorYee <http://google.com/+HectorYee>* > > > > > > > > > > > > > > > -- > > Yee Yang Li Hector <http://google.com/+HectorYee> > > *google.com/+HectorYee <http://google.com/+HectorYee>* > > > -- Yee Yang Li Hector <http://google.com/+HectorYee> *google.com/+HectorYee <http://google.com/+HectorYee>*