Thanks for the detailed explanation Ted. In light of the first case, I will provide a parameter that is used to control ngram size and calculate the values for LLR based on the occurences of leading n-1gram and the following token.
The second case is pretty interesting too. It would be nice to have something like this in mahout too. Perhaps it would be useful for auto evaluating clustering output for example. It sounds like it would be better achieved in a separate m/r impl. On Jan 9, 2010 3:46 AM, "Ted Dunning" <[email protected]> wrote: There are a couple of ways to handle this. One is to view the text as a limited horizon Markov process and look for exceptions. Thus, we might build a bigram language model and look for cases where trigrams would do better. That implies we would be looking for cases where "clack" occurs after "click and" anomalously more than would be expected from the number of times "clack" appears after "and". This comes down to comparing the counts of "clack" and all other words in the context of "click and" versus "anything-but-click and". Since "clack" is probably a small fraction of the words that appear in the second context, but exhibits an overwhelming over abundance in the context of "click and", we would conclude that "click and clack" is an important trigram. The contingency table is clack -clack click, and k11 k12 -click, and k21 k22 Theoretically speaking, this test is part of a likelihood ratio test that compares a Markov model against a restricted from of the same Markov model and is an extension of the simpler test for interesting binomials. A second approach is to consider all overlapping n-grams that are in or out of some context like a known category, or a cluster or a data source. Then we can do a normal LLR test to find items that are over-represented in some category, cluster or whatever. The size of these things doesn't actually matter all that much. This technique can be quick because you handle all lengths of n-grams at the same time as opposed to building things up bit by bit. It is limited by the availability of categories that form reasonable comparison sets. On Fri, Jan 8, 2010 at 5:13 PM, Drew Farris <[email protected]> wrote: > On Fri, Jan 8, 2010 a... -- Ted Dunning, CTO DeepDyve
