Indeed. I hadn't snapped to the fact you were using trigrams. 30 million features is quite plausible for that. To effectively use long n-grams as features in classification of documents you really need to have the following:
a) good statistical methods for resolving what is useful and what is not. Everybody here knows that my preference for a first hack is sparsification with log-likelihood ratios. b) some kind of smoothing using smaller n-grams c) some kind of smoothing over variants of n-grams. AFAIK, mahout doesn't have many or any of these in place. You are likely to do better with unigrams as a result. On Wed, Jul 22, 2009 at 11:39 AM, Grant Ingersoll <[email protected]>wrote: > I suspect the explosion in the number of features, Ted, is due to the use > of n-grams producing a lot of unique terms. I can try w/ gramSize = 1, that > will likely reduce the feature set quite a bit. > -- Ted Dunning, CTO DeepDyve
