On Fri, Jan 8, 2010 at 12:06 AM, Robin Anil <[email protected]> wrote:
> I like the Formulation that Drew made, using n-1 grams to generate n-grams. I think Ted first mentioned n-1 grams, and I ran with it. It is very useful to think about the problem this way. One questions about the concept of n-1 grams however. When n is 3 for example, are we really interested in the collocation of bigrams, or are we interested in non-overlapping tokens? For example, given the tri-gram 'click and clack', should we be looking at 'click and' and 'and clack', or are should we be analyzing 'click', 'and clack' or 'click and' and 'clack''? I suspect it is the first form because that extends easilly to values larger than 3, but it's worth confirming.
