Re: Streaming KMeans clustering

Suneel Marthi Wed, 25 Dec 2013 04:59:32 -0800

@Johannes, how many datapoints did u have in ur test?  Since the Streaming 
KMeans runs through a single reducer how much memory did u have to allocate if 
u had like a million data points?  What was the expectedDistanceCutoff you had?

@All, My experience so far has been that once you are done with the Mapper 
phase (for over a million datapoints), the Reducer fails mostly with OOM errors 
depending on the number of clusters (or datapoints?) that are specified.

It could be argued that the initial choice of 'k' to begin with wasn't right, 
well yes how does that make this any different from the old fashioned Canopy -> 
KMeans way of clustering (atleast we had a better estimate of 'k' that way), 
not to mention that users always had OOM issues with the single Reducer in 
Canopy (and has been reported by several users on user@ and dev@).

On Wednesday, December 25, 2013 4:42 AM, Johannes Schulte 
<johannes.schu...@gmail.com> wrote:

Hi,

i also had problems getting up to speed but i made the cardinality of the 
vectors responsible for that. i didn't do the math exactly but while streaming 
k-means improves over regular k-means in using log(k) and (n_umber of 
datapoints / k) passes, the d_imension parameter from the original k*d*n stays 
untouched, right?

What is your vector's cardinality?

On Wed, Dec 25, 2013 at 5:19 AM, Suneel Marthi <suneel_mar...@yahoo.com> wrote:

Ted,
>
>What were the CLI parameters when you ran this test for 1M points - no. of 
>clusters k, km, distanceMeasure, projectionSearch, estimatedDistanceCutoff?
>
>
>
>
>
>
>
>
>On Tuesday, December 24, 2013 4:23 PM, Ted Dunning <ted.dunn...@gmail.com> 
>wrote:
>
>For reference, on a 16 core machine, I was able to run the sequential
>version of streaming k-means on 1,000,000 points, each with 10 dimensions
>in about 20 seconds.  The map-reduce versions are comparable subject to
>scaling except for startup time.
>
>
>
>On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter <s...@apache.org> wrote:
>
>> That the algorithm runs a single reducer is expected. The algorithm
>> creates a sketch of the data in parallel in the map-phase, which is
>> collected by the reducer afterwards. The reducer then applies an
>> expensive in-memory clustering algorithm to the sketch.
>>
>> Which dataset are you using for testing? I can also do some tests on a
>> cluster here.
>>
>> I can imagine two possible causes for the problems: Maybe there's a
>> problem with the vectors and some calculations take very long because
>> the wrong access pattern or implementation is chosen.
>>
>> Another problem could be that the mappers and reducers have too few
>> memory and spend a lot of time running garbage collections.
>>
>> --sebastian
>>
>>
>> On 23.12.2013 22:14, Suneel Marthi wrote:
>> > Has anyone be successful running Streaming KMeans clustering on a large
>> dataset (> 100,000 points)?
>> >
>> >
>> > It just seems to take a very long time (> 4hrs) for the mappers to
>> finish on about 300K data points and the reduce phase has only a single
>> reducer running and throws an OOM failing the job several hours after the
>> job has been kicked off.
>> >
>> > Its the same story when trying to run in sequential mode.
>> >
>> > Looking at the code the bottleneck seems to be in
>> StreamingKMeans.clusterInternal(), without understanding the behaviour of
>> the algorithm I am not sure if the sequence of steps in there is correct.
>> >
>> >
>> > There are few calls that call themselves repeatedly over and over again
>> like SteamingKMeans.clusterInternal() and Searcher.searchFirst().
>> >
>> > We really need to have this working on datasets that are larger than 20K
>> reuters datasets.
>> >
>> > I am trying to run this on 300K vectors with k= 100, km = 1261 and
>> FastProjectSearch.
>> >
>>
>>

Re: Streaming KMeans clustering

Reply via email to