The idea comes from Markov language models which model the probability of a
sequence of words as the product of conditional probabilities for the next
word based on the context of the previous n-1 words.  The real virtue of
this kind of model is that it provides a very simple and tractable way to
combine information from overlapping n-grams.

Significant n-grams are then computed by comparing predictions from an order
n-1 model (which involves n-gram and (n-1)-gram counts for estimation) and
an order n-2 model (which involves (n-1)-gram and (n-2)-gram counts).  The
overall comparison can be broken down into smaller comparisons which look at
whether particular words can be predicted significantly better with more
leading context.  These smaller comparisons can be done using familiar 2x2
contingency table analysis via the LLR statistic already in wide use for
finding interesting bigrams.

Even though the mathematical form of such a model looks particularly limited
because it involves only left-context, there is considerably more generality
present than there appears to be.

In many applications, finding interesting n-grams is not necessary since
whatever learning algorithm you are using may be able to sort out which
features (terms, bigrams or n-grams) are useful and which are not.  This
technique works well with words in text (10^5 - 10^6 features) and
reasonably well with bigrams (10^10 - 10^12 possible features), but as you
move into trigrams and beyond, the curse of dimensionality can become
serious.  Some algorithms such as specialist algorithms used by Vowpal
Wabbit or confidence weighted learning are more dependent on the number of
non-zero features d than the ultimate potential number of features D.  This
average number of non-trivial features is multiplied, however, by the order
of n-grams in use which can lead to serious dimensionality problems.

Moreover, limiting the number of longer phrases used is a useful way of
doing semi-supervised learning.  A very large unmarked corpus can be used to
generate interesting phrasal features which are then used for supervised
learning on a much smaller marked corpus.  Because the number of interesting
phrases is strictly limited (possibly to a smaller number than the number of
primitive terms), this average number of non-zero features can be not much
larger than the number of raw terms and the power of the supervised learning
is enhanced by the phrasal features with limited deleterious effect due to
dimensionality.

Does that make the reasoning more clear?

On Fri, Jan 8, 2010 at 9:15 PM, Otis Gospodnetic <[email protected]
> wrote:

> I think I missed this.  Could you please explain the n-1 gram thinking and
> why that is better than thinking about n-grams as n-grams?




-- 
Ted Dunning, CTO
DeepDyve

Reply via email to