Please keep in mind that the approach I am suggesting here is untried on your kind of data. I have used it in text and transactional streams, but who knows if it will generalize.
If we take you example of "a b a a c" and let us assume that "a a" and "a b" have been observed to be interesting phrases. We would reduce your original sequence to "ab aa c". At that point, we assume that all interesting temporality is captured in the ab and aa. Almost by definition, we don't need anything else because in previous analysis, other phrases seemed to occur without predictive value. Thus, we can switch to bag-of-terms format at this point without loss of (important) information. From there, we are only concerned with the sequence x phrase occurrence matrix and can proceed with SVD or random indexing or whatever else we want. There is a serious question in this approach about what to do with overlapping phrases. My two answers are to either (a) keep both or (b) pick the tiling that gives us the most interesting sequence of tokens. Approach (b) is basically the same algorithm that is used to segment Chinese. See http://technology.chtsai.org/mmseg/ for a great example. Another class of techniques entirely which deals a bit better with the problem of (slightly) longer range dependencies is some variant on random indexing. Here, you start with a random vector for each symbol and then move each symbol vector a skoshi in the direction of the symbols it cooccurs with. If you use separate forward and backward looking vectors then you get strong directionality and sequencing. The vector of a sequence is then just the sum (perhaps weighted by rarity) of the trained vectors for each term. The size of the cooccurrence window is an important parameter here. Surprisingly, these techniques have some pretty deep connections on the mathematical level. The first approach + random basis project or SVD is very comparable to the second approach except that the second handles dependencies that skip over intervening symbols more directly. In any case, you lose some sequence information. This is, in fact, exactly what gives you a nice metric over similar sequences. On Sat, Nov 21, 2009 at 10:00 PM, prasenjit mukherjee < [email protected]> wrote: > On Sun, Nov 22, 2009 at 9:54 AM, Ted Dunning <[email protected]> > wrote: > > Expressing your symbolic sequences by tiling these phrases gives you much > of > > the temporality that you are interested and lets you use algorithms like > > k-means pretty much directly. > > Approach sounds interesting . Can you explain a bit on how you intend > to represent a sequence as a vector here ? Assuming sequence being "a > b a a c". I was thinking of the following 2 approaches : > > If I use symbols as my basis and the coefficients as time-slices then > I would loose the information of recurring symbols ( symbol a in my > example ) . e.g. vector representation of "a b a a c": 1(a)+ 2(b) + > 5(c) ( problem : how to incorporate 3a,4a ) > > On the other hand if I use time-slices as my basis and some mapping of > terms as its coefficients then my simple euclidean measure wont make > any sense. e.g. let's a->1, b->2, c->3, then vector representation of > "a b a a c": 1(t1) + 2(t2) + 1(t3) + 1(t4) + 3(t5) > > -Prasen > > > > > If you don't have symbolic sequences, you have another problem, but you > > might get similar results by doing vector quantization on your continuous > > time-series expressed in terms of multi-scale localized spectral > detectors. > > Some problems work well with those techniques, some definitely need more > > interesting feature detectors. The spectral processing and vector > > quantization are fairly natural tasks for map-reduce which is nice. In > > fact, vector quantization is commonly done with some variant on k-means. > > > -- Ted Dunning, CTO DeepDyve
