Prasen, I've been reviewing techniques and literature on data mining in time series and I found another paper that you might be interested in from the time series "search" domain that deals with similarity of time series data:
http://delab.csd.auth.gr/papers/PCI99am.pdf Sequences are transformed into a feature vector and Euclidean distances between the feature vectors are then calculated. I'm still getting this concept (plus other variations) and mahout in general "mapped out". I read some on suffix trees and they look very similar to k-grams and permutation indexes in Information Retrieval material. I'm still digesting this time series problem (and its several sub problems) but I thought I'd throw that paper out there and see what you thought. Josh Patterson TVA -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of prasenjit mukherjee Sent: Saturday, November 21, 2009 12:21 AM To: [email protected] Subject: Re: mahout examples Hi Josh, I too am working on clustering time-series-data, and basically trying to come up with a sequence clustering model. Would like to know how you intend to use K-means to achieve that. Are you treating each sequence as a point ? Then, what would be your vector representation of a sequence and also more importantly which metric ( distance computation logic ) will you be using ? BTW, I am thinking along the lines of STC ( suffix-tree based clustering ). -Prasen On Sat, Nov 21, 2009 at 1:26 AM, Patterson, Josh <[email protected]> wrote: > I think in terms of clustering time series data, the first step looks to > be vectorizing the input cases with possibly the DenseVector class and > feeding that to a basic KMeans implementation like KMeansDriver.java. > Once we can get the basic kmeans rolling with some known dataset we'll > be able to iterate on that and move towards using more complex > techniques and other grid timeseries data. Any suggestions or discussion > is greatly appreciated, > > Josh Patterson > TVA
