On 25.12.2013 14:19, Suneel Marthi wrote: > > > > > >>> On Tuesday, December 24, 2013 4:23 PM, Ted Dunning <ted.dunn...@gmail.com> >>> wrote: > >>> For reference, on a 16 core machine, I was able to run the sequential >>> version of streaming k-means on 1,000,000 points, each with 10 dimensions >>> in about 20 seconds. The map-reduce versions are comparable subject to >>> scaling except for startup time. > > @Ted, were u working off the Streaming KMeans impl as in Mahout 0.8. Not sure > how this would have even worked for u in sequential mode in light of the > issues reported against M-1314, M-1358, M-1380 (all of which impact the > sequential mode); unless u had fixed them locally. > What were ur estimatedDistanceCutoff, number of clusters 'k', projection > search and how much memory did u have to allocate to the single Reducer?
If I read the source code correctly, the final reducer clusters the sketch which should contain m * k * log n intermediate centroids, where k is the number of desired clusters, m is the number of mappers run and n is the number of datapoints. Those centroids are expected to be dense, so we can estimate the memory required for the final reducer using this formula. > > > > > On Mon, Dec 23, 2013 at 1:41 PM, Sebastian Schelter <s...@apache.org> wrote: > >> That the algorithm runs a single reducer is expected. The algorithm >> creates a sketch of > the data in parallel in the map-phase, which is >> collected by the reducer afterwards. The reducer then applies an >> expensive in-memory clustering algorithm to the sketch. >> >> Which dataset are you using for testing? I can also do some tests on a >> cluster here. >> >> I can imagine two possible causes for the problems: Maybe there's a >> problem with the vectors and some calculations take very long because >> the wrong access pattern or implementation is chosen. >> >> Another problem could be that the mappers and reducers have too few >> memory and spend a lot of time running garbage collections. >> >> --sebastian >> >> >> On 23.12.2013 22:14, > Suneel Marthi wrote: >>> Has anyone be successful running Streaming KMeans clustering on a large >> dataset (> 100,000 points)? >>> >>> >>> It just seems to take a very long time (> 4hrs) for the mappers to >> finish on about 300K data points and the reduce phase has only a single >> reducer running and throws an OOM failing the job several hours after the >> job has been kicked off. >>> >>> Its the same story when trying to run in sequential mode. >>> >>> Looking at the code the bottleneck seems to be in >> StreamingKMeans.clusterInternal(), without understanding the behaviour of >> the algorithm I am not sure if the sequence of steps in there is correct. >>> >>> >>> There are few calls that call themselves repeatedly over and over again >> like SteamingKMeans.clusterInternal() and Searcher.searchFirst(). >>> >>> We really need to have this working on datasets that are larger than 20K >> reuters datasets. >>> >>> I am trying to run this on 300K vectors with k= 100, km = 1261 and >> FastProjectSearch. >>> >> >>