It is a different algorithm altogether. See http://www.math.uwaterloo.ca/~cswamy/papers/kmeansfnl.pdf for reference.
The basic idea underlying the algorithm is two-fold: - pick good starting points that have high likelihood of being in the core of each cluster - use points in the near neighborhood of the starting points to update the centroids With well-clusterable data, this gives a clustering algorithm that provably achieves high quality in a single step with high probability. With less well-clusterable data, the situation is less clear, but a few iterations typically produces very good results. The initialization algorithm is similar to k-means++. See wikipedia for more about that. On Thu, Aug 9, 2012 at 9:06 AM, Paritosh Ranjan <[email protected]> wrote: > Ted, > > Is the ball k-means related to the ball tree based implementation of knn, or > its a different algorithm altogether? > > On 09-08-2012 19:54, Ted Dunning wrote: >> >> Also, the knn package has a single pass k-means implementation that >> can easily handle 20,000 clusters or more. This is done by using an >> approximate nearest neighbor algorithm inside the k-means >> implementation to decrease the time dependency on k to roughly log k. >> >> See http://github.com/tdunning/knn >> >> Any help in testing these new capabilities or plumbing them into the >> standard Mahout capabilities would be very much appreciated. >> >> On Thu, Aug 9, 2012 at 7:05 AM, Ted Dunning <[email protected]> wrote: >>> >>> The upcoming knn package has a file based matrix implementation that uses >>> memory mapping to allow sharing a copy of a large matrix between processes >>> and threads. >>> >>> Sent from my iPhone >>> >>> On Aug 9, 2012, at 1:48 AM, Abramov Pavel <[email protected]> >>> wrote: >>> >>>> Hello, >>>> >>>> If think Zipf's law is relevant for my data. Thanks for idea about >>>> memory-mapping. >>>> >>>> 1) How can I "drop" extremely small/large clusters? There are 50% small >>>> clusters with only 1 member while 1 large cluster has 50% members. Is >>>> there a way to "split" large clusters with Kmeans? >>>> >>>> 2) Can I force Mahout not to use exact centroid but the closest point >>>> from >>>> my set? Any point has ~10 non-zero components while exact centroid is >>>> very >>>> dense (~200k). >>>> >>>> >>>> Thanks! >>>> >>>> Pavel >>>> >>>> >>>> 09.08.12 5:43 пользователь "Lance Norskog" <[email protected]> написал: >>>> >>>>> If Zipf's Law is relevant, locality will be much better than random. >>>>> Maybe you need a Vector implementation that is backed by memory-mapped >>>>> files? >>>>> >>>>> On Wed, Aug 8, 2012 at 12:26 PM, Abramov Pavel >>>>> <[email protected]> >>>>> wrote: >>>>>> >>>>>> Thank you Jeff, Paritosh, >>>>>> >>>>>> Reducing cluster count from 1000 to 100 made my day. I will also try >>>>>> Canopy for initial cluster count. >>>>>> Unfortunately I don't know how to reduce my 200k dictionary. There are >>>>>> no >>>>>> unfrequent terms. >>>>>> >>>>>> I will provide you with Hadoop config shortly. But I am pretty sure I >>>>>> can't decrease number of Mappers/Reducers per node or increase memory >>>>>> limits. It will affect the whole cluster. >>>>>> >>>>>> >>>>>> Thank you! >>>>>> >>>>>> Pavel >>>>>> >>>>>> >>>>>> 08.08.12 16:15 пользователь "Jeff Eastman" >>>>>> <[email protected]> >>>>>> написал: >>>>>> >>>>>>> Consider that each cluster retains 4 vectors in memory in each mapper >>>>>>> and reducer, and that these vectors tend to become more dense >>>>>>> (through >>>>>>> addition of multiple sparse components) as iterations proceed. With >>>>>>> 1000 >>>>>>> clusters and 200k terms in your dictionary this will cause the heap >>>>>>> space to be consumed rapidly as you have noted. Some times you can >>>>>>> work >>>>>>> around this problem by increasing your heap size on a per-job basis >>>>>>> or >>>>>>> reducing the number of mappers and reducers allowed on each node. >>>>>>> Also >>>>>>> be sure you are not launching reducers until all of your mapper tasks >>>>>>> have completed. >>>>>>> >>>>>>> In order to provide more help to you, it would be useful to >>>>>>> understand >>>>>>> more about how your cluster is "well tuned". How many mappers & >>>>>>> reducers >>>>>>> are you launching in parallel, the heapspace limits set for tasks on >>>>>>> each node, etc. >>>>>>> >>>>>>> For a quick test, try reducing the k to 500 or 100 to see how many >>>>>>> clusters you can reasonably compute with your dataset on your >>>>>>> cluster. >>>>>>> Canopy is also a good way to get a feel for a good initial k, though >>>>>>> it >>>>>>> can be hard to arrive at good T-values in some text clustering cases. >>>>>>> You, can also try hierarchical clustering with reduced k to stay >>>>>>> under >>>>>>> your memory limits. >>>>>>> >>>>>>> >>>>>>> On 8/8/12 5:40 AM, Paritosh Ranjan wrote: >>>>>>>> >>>>>>>> A stacktrace of error would have helped in finding the exact error. >>>>>>>> >>>>>>>> However, number of clusters can create Heap Space problems ( If the >>>>>>>> vector dimension is also high ). >>>>>>>> Either try to reduce the number of initial clusters ( In my opinion, >>>>>>>> the best way to know about initial clusters is Canopy Clustering >>>>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering >>>>>>>> ) >>>>>>>> >>>>>>>> or, try to reduce the dimension of the vectors. >>>>>>>> >>>>>>>> PS : you are also providing numClusters twice >>>>>>>> >>>>>>>> --numClusters 1000 \ --numClusters 5 \ >>>>>>>> >>>>>>>> On 08-08-2012 10:42, Abramov Pavel wrote: >>>>>>>>> >>>>>>>>> Hello, >>>>>>>>> >>>>>>>>> I am trying to run KMeans example on 15 000 000 documents >>>>>>>>> (seq2sparse >>>>>>>>> output). >>>>>>>>> There are 1 000 clusters, 200 000 terms dictionary and 3-10 terms >>>>>>>>> document size (titles). seq2sparse produces 200 files 80 MB each. >>>>>>>>> >>>>>>>>> My job failed with Java heap space Error. 1st iteration passes >>>>>>>>> while >>>>>>>>> 2nd iteration fails. On a Map phase of buildClusters I see a lot of >>>>>>>>> warnings, but it passes. Reduce phase of buildClusters fails with >>>>>>>>> "Java Heap space". >>>>>>>>> >>>>>>>>> I can not increase reducer/mapper memory in hadoop. My cluster is >>>>>>>>> tunned well. >>>>>>>>> >>>>>>>>> How can I avoid this situation? My cluster has 300 Mappers and 220 >>>>>>>>> Reducers running 40 servers 8-core 12 GB RAM. >>>>>>>>> >>>>>>>>> Thanks in advance! >>>>>>>>> >>>>>>>>> Here is KMeans parameters: >>>>>>>>> >>>>>>>>> ------------------------------------------------ >>>>>>>>> mahout kmeans -Dmapred.reduce.tasks=200 \ >>>>>>>>> -i ...tfidf-vectors/ \ >>>>>>>>> -o /tmp/clustering_results_kmeans/ \ >>>>>>>>> --clusters /tmp/clusters/ \ >>>>>>>>> --numClusters 1000 \ >>>>>>>>> --numClusters 5 \ >>>>>>>>> --overwrite \ >>>>>>>>> --clustering >>>>>>>>> ------------------------------------------------ >>>>>>>>> >>>>>>>>> Pavel >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>> >>>>> >>>>> -- >>>>> Lance Norskog >>>>> [email protected] > > >
