The hadoop way of approaching a similar problem as suffix trees would be to
broadcast short symbol sequences into a large counter.  This is wasteful in
terms of total data transferred, but it makes much of that back up because
the transfers are all very sequential.  You can discard sequences with few
occurrences immediately and then use the counts to further filter the list,
eliminating uninteresting sequences.

Expressing your symbolic sequences by tiling these phrases gives you much of
the temporality that you are interested and lets you use algorithms like
k-means pretty much directly.

If you don't have symbolic sequences, you have another problem, but you
might get similar results by doing vector quantization on your continuous
time-series expressed in terms of multi-scale localized spectral detectors.
Some problems work well with those techniques, some definitely need more
interesting feature detectors.  The spectral processing and vector
quantization are fairly natural tasks for map-reduce which is nice.  In
fact, vector quantization is commonly done with some variant on k-means.

On Sat, Nov 21, 2009 at 6:21 PM, Patterson, Josh <[email protected]>wrote:

> Prasen,
> Well, I'm not entirely sure how I'm going to do it right now, its going to
> come down to trial and error with multiple approaches. There are many
> obstacles to overcome, including the ones you are speaking of like:
>
> - timeseries vector representation
> - timeseries shifts, scaling
> - for clustering, need a heuristic for distance metric
>
> I have not heard of suffix-tree based clustering, but it definitely sounds
> interesting and I'll check that one out. I suggested kmeans initially as its
> a very basic and well known clustering technique. I need to review my older
> notes for approaches to solve some of these projects and review more papers
> dealing with timeseries data in general. My initial thought is to decompose
> a block of timeseries data into a set of time ordered "features" in order to
> reduce the effect of scaling issues and alignment effects. That would make
> it easier to calculate to a delta between vectors, but also might create
> unintended approximation errors.
>
> As I get more Mahout code running, I'll probably post relevant research
> papers dealing with time series data that I'm looking at. That way anyone
> can chip in their opinion of the approach and we can build more example code
> for dealing with timeseries in Mahout.
>
> I'll take a look at the suffix-tree technique, and I look forward to any
> other suggestions you might have in terms of an approach.
>
> Josh Patterson
> TVA
>
>
> -----Original Message-----
> From: [email protected] on behalf of prasenjit mukherjee
> Sent: Sat 11/21/2009 12:20 AM
> To: [email protected]
> Subject: Re: mahout examples
>
> Hi Josh,
>
> I too am working on  clustering time-series-data, and basically trying
> to come up with a sequence clustering model. Would like to know how
> you intend to use K-means to achieve that.  Are you  treating each
> sequence as a point ?  Then, what would be your vector representation
> of a sequence and also more importantly which metric ( distance
> computation logic ) will you be using ?
>
> BTW, I am thinking along the lines of STC ( suffix-tree based clustering ).
>
> -Prasen
>
> On Sat, Nov 21, 2009 at 1:26 AM, Patterson, Josh <[email protected]>
> wrote:
> > I think in terms of clustering time series data, the first step looks to
> > be vectorizing the input cases with possibly the DenseVector class and
> > feeding that to a basic KMeans implementation like KMeansDriver.java.
> > Once we can get the basic kmeans rolling with some known dataset we'll
> > be able to iterate on that and move towards using more complex
> > techniques and other grid timeseries data. Any suggestions or discussion
> > is greatly appreciated,
> >
> > Josh Patterson
> > TVA
>
>


-- 
Ted Dunning, CTO
DeepDyve

Reply via email to