N-squared isn't all that scalable. But k-means doesn't require all n^2 distances to be computed. It just requires that distances to the centroids be computed.
That is generally vastly less effort and since the number of clusters is typically bounded by a constant, it makes k-means into a nearly linear algorithm. On Sun, Nov 29, 2009 at 11:07 PM, prasenjit mukherjee < [email protected]> wrote: > Thanks for sharing the article. The article focuses mainly on > distance computation between sequences, which will help us in creating > the self-similarity matrix. And then you can probably apply any > standard self-similarity based clustering techniques ( spectral > clustering or k-means etc. ). > > Approach sounds okay, except that k-means requires the nXn matrix to > be computed which itself could be pretty huge. But as long as you can > distribute ( which you apparantly can ) over mapreduce/mahout it > should be fine. > > -Prasen > > On Fri, Nov 27, 2009 at 9:47 PM, Patterson, Josh <[email protected]> > wrote: > > Prasen, > > I've been reviewing techniques and literature on data mining in time > > series and I found another paper that you might be interested in from > > the time series "search" domain that deals with similarity of time > > series data: > > > > http://delab.csd.auth.gr/papers/PCI99am.pdf > > > > Sequences are transformed into a feature vector and Euclidean distances > > between the feature vectors are then calculated. I'm still getting this > > concept (plus other variations) and mahout in general "mapped out". I > > read some on suffix trees and they look very similar to k-grams and > > permutation indexes in Information Retrieval material. I'm still > > digesting this time series problem (and its several sub problems) but I > > thought I'd throw that paper out there and see what you thought. > > > > Josh Patterson > > TVA > > > > -----Original Message----- > > From: [email protected] [mailto:[email protected]] On Behalf Of > > prasenjit mukherjee > > Sent: Saturday, November 21, 2009 12:21 AM > > To: [email protected] > > Subject: Re: mahout examples > > > > Hi Josh, > > > > I too am working on clustering time-series-data, and basically trying > > to come up with a sequence clustering model. Would like to know how > > you intend to use K-means to achieve that. Are you treating each > > sequence as a point ? Then, what would be your vector representation > > of a sequence and also more importantly which metric ( distance > > computation logic ) will you be using ? > > > > BTW, I am thinking along the lines of STC ( suffix-tree based clustering > > ). > > > > -Prasen > > > > On Sat, Nov 21, 2009 at 1:26 AM, Patterson, Josh <[email protected]> > > wrote: > >> I think in terms of clustering time series data, the first step looks > > to > >> be vectorizing the input cases with possibly the DenseVector class and > >> feeding that to a basic KMeans implementation like KMeansDriver.java. > >> Once we can get the basic kmeans rolling with some known dataset we'll > >> be able to iterate on that and move towards using more complex > >> techniques and other grid timeseries data. Any suggestions or > > discussion > >> is greatly appreciated, > >> > >> Josh Patterson > >> TVA > > > -- Ted Dunning, CTO DeepDyve
