This comparison is very interesting when against a general corpus or specific sub-corpus already in your data.
You will often find that an n-gram is in one corpus an not in another, but the question becomes how much this happens (i.e. does LLR say that this happens enough to be interesting). Taking the max over scores of many comparisons becomes the interesting number then. On Tue, Feb 16, 2010 at 11:01 AM, Drew Farris <[email protected]> wrote: > I also was wondering if comparing the ngrams found in this corpus > against a general corpus could be a worthwhile endeavor? Some quick > and dirty work suggests that the overlap in n-grams between this > domain-specific corpus and a general one is pretty low. > -- Ted Dunning, CTO DeepDyve
