The idea comes from Markov language models which model the probability of a sequence of words as the product of conditional probabilities for the next word based on the context of the previous n-1 words. The real virtue of this kind of model is that it provides a very simple and tractable way to combine information from overlapping n-grams.
Significant n-grams are then computed by comparing predictions from an order n-1 model (which involves n-gram and (n-1)-gram counts for estimation) and an order n-2 model (which involves (n-1)-gram and (n-2)-gram counts). The overall comparison can be broken down into smaller comparisons which look at whether particular words can be predicted significantly better with more leading context. These smaller comparisons can be done using familiar 2x2 contingency table analysis via the LLR statistic already in wide use for finding interesting bigrams. Even though the mathematical form of such a model looks particularly limited because it involves only left-context, there is considerably more generality present than there appears to be. In many applications, finding interesting n-grams is not necessary since whatever learning algorithm you are using may be able to sort out which features (terms, bigrams or n-grams) are useful and which are not. This technique works well with words in text (10^5 - 10^6 features) and reasonably well with bigrams (10^10 - 10^12 possible features), but as you move into trigrams and beyond, the curse of dimensionality can become serious. Some algorithms such as specialist algorithms used by Vowpal Wabbit or confidence weighted learning are more dependent on the number of non-zero features d than the ultimate potential number of features D. This average number of non-trivial features is multiplied, however, by the order of n-grams in use which can lead to serious dimensionality problems. Moreover, limiting the number of longer phrases used is a useful way of doing semi-supervised learning. A very large unmarked corpus can be used to generate interesting phrasal features which are then used for supervised learning on a much smaller marked corpus. Because the number of interesting phrases is strictly limited (possibly to a smaller number than the number of primitive terms), this average number of non-zero features can be not much larger than the number of raw terms and the power of the supervised learning is enhanced by the phrasal features with limited deleterious effect due to dimensionality. Does that make the reasoning more clear? On Fri, Jan 8, 2010 at 9:15 PM, Otis Gospodnetic <[email protected] > wrote: > I think I missed this. Could you please explain the n-1 gram thinking and > why that is better than thinking about n-grams as n-grams? -- Ted Dunning, CTO DeepDyve
