Thanks for the detailed explanation Ted. In light of the first case, I will
provide a parameter that is used to control ngram size and calculate the
values for LLR based on the occurences of leading n-1gram and the following
token.

The second case is pretty interesting too. It would be nice to have
something like this in mahout too. Perhaps it would be useful for auto
evaluating clustering output for example. It sounds like it would be better
achieved in a separate m/r impl.

On Jan 9, 2010 3:46 AM, "Ted Dunning" <[email protected]> wrote:

There are a couple of ways to handle this.

One is to view the text as a limited horizon Markov process and look for
exceptions.  Thus, we might build a bigram language model and look for cases
where trigrams would do better.  That implies we would be looking for cases
where "clack" occurs after "click and" anomalously more than would be
expected from the number of times "clack" appears after "and".  This comes
down to comparing the counts of "clack" and all other words in the context
of "click and" versus "anything-but-click and".  Since "clack" is probably a
small fraction of the words that appear in the second context, but exhibits
an overwhelming over abundance in the context of "click and", we would
conclude that "click and clack" is an important trigram.  The contingency
table is

                        clack    -clack
            click, and    k11      k12
            -click, and   k21      k22

Theoretically speaking, this test is part of a likelihood ratio test that
compares a Markov model against a restricted from of the same Markov model
and is an extension of the simpler test for interesting binomials.

A second approach is to consider all overlapping n-grams that are in or out
of some context like a known category, or a cluster or a data source.  Then
we can do a normal LLR test to find items that are over-represented in some
category, cluster or whatever.   The size of these things doesn't actually
matter all that much.   This technique can be quick because you handle all
lengths of n-grams at the same time as opposed to building things up bit by
bit.   It is limited by the availability of categories that form reasonable
comparison sets.

On Fri, Jan 8, 2010 at 5:13 PM, Drew Farris <[email protected]> wrote: >
On Fri, Jan 8, 2010 a...
--
Ted Dunning, CTO
DeepDyve

Reply via email to