If you don't care about ordering, then this is pretty easy to do. Sequences are the documents, items are terms. From there you can do some sort of latent variable method and, with the appropriate latent variables, cluster directly in latent variable space. With SVD and LDA, this is fairly trivial to do.
If you want to preserve some ordering ifnormation, then you have a bit more of a problem. The same basic idea can work where you model your data as a mixture density over sequence models. Once you do that, then the mixture parameters make a reasonable space to cluster in. If you have some kind of sequence model then the dirichlet process code currently in Mahout can be used to do your clustering. There is probably one too many if's in the previous paragraph for you to be happy with it. Can you say something more about your sequences? Can you say something about your resources? Do you have a good sequence model? On Wed, Nov 18, 2009 at 4:03 AM, prasenjit mukherjee <[email protected]>wrote: > Can we model the sequence clustering problem into a traditional > term-doc clustering ? > > One approach I can think of is creating a self-similarity matrix > between the sequences and then running a traditional clustering algo ( > spectral or k-means ). That seems to be too expensive though. > > Any suggestions ? > > Thanks, > -Prasen > > On Wed, Nov 11, 2009 at 3:53 PM, Isabel Drost <[email protected]> wrote: > > On Sat prasenjit mukherjee <[email protected]> wrote: > > > >> I was thinking of using a semi-supervised ( unsupervised will be even > >> better ) sequence clustering technique ( like CRF, HMM etc. ) Just > >> curious, any work been done ( or discussed ) in this mailing list to > >> perform sequence clustering using temporal data. > > > > So far none that I am aware of. There were a few discussions on HMMs > > early on, but I am not sure what came out of that. > > > > Isabel > > > -- Ted Dunning, CTO DeepDyve
