It is a different algorithm altogether.

See http://www.math.uwaterloo.ca/~cswamy/papers/kmeansfnl.pdf for reference.

The basic idea underlying the algorithm is two-fold:

- pick good starting points that have high likelihood of being in the
core of each cluster

- use points in the near neighborhood of the starting points to update
the centroids

With well-clusterable data, this gives a clustering algorithm that
provably achieves high quality in a single step with high probability.
 With less well-clusterable data, the situation is less clear, but a
few iterations typically produces very good results.

The initialization algorithm is similar to k-means++.  See wikipedia
for more about that.

On Thu, Aug 9, 2012 at 9:06 AM, Paritosh Ranjan <[email protected]> wrote:
> Ted,
>
> Is the ball k-means related to the ball tree based implementation of knn, or
> its a different algorithm altogether?
>
> On 09-08-2012 19:54, Ted Dunning wrote:
>>
>> Also, the knn package has a single pass k-means implementation that
>> can easily handle 20,000 clusters or more.  This is done by using an
>> approximate nearest neighbor algorithm inside the k-means
>> implementation to decrease the time dependency on k to roughly log k.
>>
>> See http://github.com/tdunning/knn
>>
>> Any help in testing these new capabilities or plumbing them into the
>> standard Mahout capabilities would be very much appreciated.
>>
>> On Thu, Aug 9, 2012 at 7:05 AM, Ted Dunning <[email protected]> wrote:
>>>
>>> The upcoming knn package has a file based matrix implementation that uses
>>> memory mapping to allow sharing a copy of a large matrix between processes
>>> and threads.
>>>
>>> Sent from my iPhone
>>>
>>> On Aug 9, 2012, at 1:48 AM, Abramov Pavel <[email protected]>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> If think Zipf's law is relevant for my data. Thanks for idea about
>>>> memory-mapping.
>>>>
>>>> 1) How can I "drop" extremely small/large clusters? There are 50% small
>>>> clusters with only 1 member while 1 large cluster has 50% members. Is
>>>> there a way to "split" large clusters with Kmeans?
>>>>
>>>> 2) Can I force Mahout not to use exact centroid but the closest point
>>>> from
>>>> my set? Any point has ~10 non-zero components while exact centroid is
>>>> very
>>>> dense (~200k).
>>>>
>>>>
>>>> Thanks!
>>>>
>>>> Pavel
>>>>
>>>>
>>>> 09.08.12 5:43 пользователь "Lance Norskog" <[email protected]> написал:
>>>>
>>>>> If Zipf's Law is relevant, locality will be much better than random.
>>>>> Maybe you need a Vector implementation that is backed by memory-mapped
>>>>> files?
>>>>>
>>>>> On Wed, Aug 8, 2012 at 12:26 PM, Abramov Pavel
>>>>> <[email protected]>
>>>>> wrote:
>>>>>>
>>>>>> Thank you Jeff, Paritosh,
>>>>>>
>>>>>> Reducing cluster count from 1000 to 100 made my day. I will also try
>>>>>> Canopy for initial cluster count.
>>>>>> Unfortunately I don't know how to reduce my 200k dictionary. There are
>>>>>> no
>>>>>> unfrequent terms.
>>>>>>
>>>>>> I will provide you with Hadoop config shortly. But I am pretty sure I
>>>>>> can't decrease number of Mappers/Reducers per node or increase memory
>>>>>> limits. It will affect the whole cluster.
>>>>>>
>>>>>>
>>>>>> Thank you!
>>>>>>
>>>>>> Pavel
>>>>>>
>>>>>>
>>>>>> 08.08.12 16:15 пользователь "Jeff Eastman"
>>>>>> <[email protected]>
>>>>>> написал:
>>>>>>
>>>>>>> Consider that each cluster retains 4 vectors in memory in each mapper
>>>>>>> and reducer, and that these vectors tend to become more dense
>>>>>>> (through
>>>>>>> addition of multiple sparse components) as iterations proceed. With
>>>>>>> 1000
>>>>>>> clusters and 200k terms in your dictionary this will cause the heap
>>>>>>> space to be consumed rapidly as you have noted. Some times you can
>>>>>>> work
>>>>>>> around this problem by increasing your heap size on a per-job basis
>>>>>>> or
>>>>>>> reducing the number of mappers and reducers allowed on each node.
>>>>>>> Also
>>>>>>> be sure you are not launching reducers until all of your mapper tasks
>>>>>>> have completed.
>>>>>>>
>>>>>>> In order to provide more help to you, it would be useful to
>>>>>>> understand
>>>>>>> more about how your cluster is "well tuned". How many mappers &
>>>>>>> reducers
>>>>>>> are you launching in parallel, the heapspace limits set for tasks on
>>>>>>> each node, etc.
>>>>>>>
>>>>>>> For a quick test, try reducing the k to 500 or 100 to see how many
>>>>>>> clusters you can reasonably compute with your dataset on your
>>>>>>> cluster.
>>>>>>> Canopy is also a good way to get a feel for a good initial k, though
>>>>>>> it
>>>>>>> can be hard to arrive at good T-values in some text clustering cases.
>>>>>>> You, can also try hierarchical clustering with reduced k to stay
>>>>>>> under
>>>>>>> your memory limits.
>>>>>>>
>>>>>>>
>>>>>>> On 8/8/12 5:40 AM, Paritosh Ranjan wrote:
>>>>>>>>
>>>>>>>> A stacktrace of error would have helped in finding the exact error.
>>>>>>>>
>>>>>>>> However, number of clusters can create Heap Space problems ( If the
>>>>>>>> vector dimension is also high ).
>>>>>>>> Either try to reduce the number of initial clusters ( In my opinion,
>>>>>>>> the best way to know about initial clusters is Canopy Clustering
>>>>>>>> https://cwiki.apache.org/confluence/display/MAHOUT/Canopy+Clustering
>>>>>>>> )
>>>>>>>>
>>>>>>>> or, try to reduce the dimension of the vectors.
>>>>>>>>
>>>>>>>> PS : you are also providing numClusters twice
>>>>>>>>
>>>>>>>> --numClusters 1000 \ --numClusters 5 \
>>>>>>>>
>>>>>>>> On 08-08-2012 10:42, Abramov Pavel wrote:
>>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> I am trying to run KMeans example on 15 000 000 documents
>>>>>>>>> (seq2sparse
>>>>>>>>> output).
>>>>>>>>> There are 1 000 clusters, 200 000 terms dictionary and 3-10 terms
>>>>>>>>> document size (titles). seq2sparse produces 200 files 80 MB each.
>>>>>>>>>
>>>>>>>>> My job failed with Java heap space Error. 1st iteration passes
>>>>>>>>> while
>>>>>>>>> 2nd iteration fails. On a Map phase of buildClusters I see a lot of
>>>>>>>>> warnings, but it passes. Reduce phase of buildClusters fails with
>>>>>>>>> "Java Heap space".
>>>>>>>>>
>>>>>>>>> I can not increase reducer/mapper memory in hadoop. My cluster is
>>>>>>>>> tunned well.
>>>>>>>>>
>>>>>>>>> How can I avoid this situation? My cluster has 300 Mappers and 220
>>>>>>>>> Reducers running 40 servers 8-core 12 GB RAM.
>>>>>>>>>
>>>>>>>>> Thanks in advance!
>>>>>>>>>
>>>>>>>>> Here is KMeans parameters:
>>>>>>>>>
>>>>>>>>> ------------------------------------------------
>>>>>>>>> mahout kmeans -Dmapred.reduce.tasks=200 \
>>>>>>>>> -i ...tfidf-vectors/  \
>>>>>>>>> -o /tmp/clustering_results_kmeans/ \
>>>>>>>>> --clusters /tmp/clusters/ \
>>>>>>>>> --numClusters 1000 \
>>>>>>>>> --numClusters 5 \
>>>>>>>>> --overwrite \
>>>>>>>>> --clustering
>>>>>>>>> ------------------------------------------------
>>>>>>>>>
>>>>>>>>> Pavel
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Lance Norskog
>>>>> [email protected]
>
>
>

Reply via email to