Re: Streaming KMeans clustering

Sebastian Schelter Wed, 25 Dec 2013 06:45:36 -0800

On 25.12.2013 14:19, Suneel Marthi wrote:
> 
> 
> 
> 
> 
>>> On Tuesday, December 24, 2013 4:23 PM, Ted Dunning <ted.dunn...@gmail.com> 
>>> wrote:
>  
>>> For reference, on a 16 core machine, I was able to run the sequential
>>> version of streaming k-means on 1,000,000 points, each with 10 dimensions
>>> in about 20 seconds.  The map-reduce versions are comparable subject to
>>> scaling except for startup time.
> 
> @Ted, were u working off the Streaming KMeans impl as in Mahout 0.8. Not sure 
> how this would have even worked for u in sequential mode in light of the 
> issues reported against M-1314, M-1358, M-1380 (all of which impact the 
> sequential mode); unless u had fixed them locally.
> What were ur estimatedDistanceCutoff, number of clusters 'k', projection 
> search and how much memory did u have to allocate to the single Reducer?


If I read the source code correctly, the final reducer clusters the
sketch which should contain m * k * log n intermediate centroids, where
k is the number of desired clusters, m is the number of mappers run and
n is the number of datapoints. Those centroids are expected to be dense,
so we can estimate the memory required for the final reducer using this
formula.

> 
> 
> 
> 
> On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter <s...@apache.org> wrote:
> 
>> That the algorithm runs a single reducer is expected. The algorithm
>> creates a sketch of
>  the data in parallel in the map-phase, which is
>> collected by the reducer afterwards. The reducer then applies an
>> expensive in-memory clustering algorithm to the sketch.
>>
>> Which dataset are you using for testing? I can also do some tests on a
>> cluster here.
>>
>> I can imagine two possible causes for the problems: Maybe there's a
>> problem with the vectors and some calculations take very long because
>> the wrong access pattern or implementation is chosen.
>>
>> Another problem could be that the mappers and reducers have too few
>> memory and spend a lot of time running garbage collections.
>>
>> --sebastian
>>
>>
>> On 23.12.2013 22:14,
>  Suneel Marthi wrote:
>>> Has anyone be successful running Streaming KMeans clustering on a large
>> dataset (> 100,000 points)?
>>>
>>>
>>> It just seems to take a very long time (> 4hrs) for the mappers to
>> finish on about 300K data points and the reduce phase has only a single
>> reducer running and throws an OOM failing the job several hours after the
>> job has been kicked off.
>>>
>>> Its the same story when trying to run in sequential mode.
>>>
>>> Looking at the code the bottleneck seems to be in
>> StreamingKMeans.clusterInternal(), without understanding the behaviour of
>> the algorithm I am not sure if the sequence of steps in there is correct.
>>>
>>>
>>> There are few calls that call themselves repeatedly over and over again
>> like SteamingKMeans.clusterInternal() and Searcher.searchFirst().
>>>
>>> We really need to have this working on datasets that are larger than 20K
>> reuters datasets.
>>>
>>> I am trying to run this on 300K vectors with k= 100, km = 1261 and
>> FastProjectSearch.
>>>
>>
>>

Re: Streaming KMeans clustering

Reply via email to