Indeed.  I hadn't snapped to the fact you were using trigrams.

30 million features is quite plausible for that.  To effectively use long
n-grams as features in classification of documents you really need to have
the following:

a) good statistical methods for resolving what is useful and what is not.
Everybody here knows that my preference for a first hack is sparsification
with log-likelihood ratios.

b) some kind of smoothing using smaller n-grams

c) some kind of smoothing over variants of n-grams.

AFAIK, mahout doesn't have many or any of these in place.  You are likely to
do better with unigrams as a result.

On Wed, Jul 22, 2009 at 11:39 AM, Grant Ingersoll <[email protected]>wrote:

> I suspect the explosion in the number of features, Ted, is due to the use
> of n-grams producing a lot of unique terms.  I can try w/ gramSize = 1, that
> will likely reduce the feature set quite a bit.
>



-- 
Ted Dunning, CTO
DeepDyve

Reply via email to