If you don't care about ordering, then this is pretty easy to do.  Sequences
are the documents, items are terms.  From there you can do some sort of
latent variable method and, with the appropriate latent variables, cluster
directly in latent variable space.  With SVD and LDA, this is fairly trivial
to do.

If you want to preserve some ordering ifnormation, then you have a bit more
of a problem.  The same basic idea can work where you model your data as a
mixture density over sequence models.  Once you do that, then the mixture
parameters make a reasonable space to cluster in.  If you have some kind of
sequence model then the dirichlet process code currently in Mahout can be
used to do your clustering.

There is probably one too many if's in the previous paragraph for you to be
happy with it.

Can you say something more about your sequences?  Can you say something
about your resources?  Do you have a good sequence model?

On Wed, Nov 18, 2009 at 4:03 AM, prasenjit mukherjee
<[email protected]>wrote:

> Can we model the sequence clustering problem into a traditional
> term-doc clustering ?
>
> One approach I can think of is creating a self-similarity matrix
> between the sequences and then running a traditional clustering algo (
> spectral or k-means ). That seems to be too expensive though.
>
> Any suggestions ?
>
> Thanks,
> -Prasen
>
> On Wed, Nov 11, 2009 at 3:53 PM, Isabel Drost <[email protected]> wrote:
> > On Sat prasenjit mukherjee <[email protected]> wrote:
> >
> >> I was thinking of using a semi-supervised ( unsupervised will be even
> >> better ) sequence clustering technique ( like CRF, HMM etc. ) Just
> >> curious, any work been done ( or discussed ) in this mailing list to
> >> perform sequence clustering using temporal data.
> >
> > So far none that I am aware of. There were a few discussions on HMMs
> > early on, but I am not sure what came out of that.
> >
> > Isabel
> >
>



-- 
Ted Dunning, CTO
DeepDyve

Reply via email to