Re: Getting Started with Classification

Grant Ingersoll Wed, 22 Jul 2009 11:39:46 -0700

I'm doing a pretty naive (pun intended) approach for this based on theviewpoint of someone coming in new to Mahout and ML, for that matter,(I also will admit I haven't done a lot of practical classificationmyself, even if I've read many of the papers, so it isn't a stretchfor me) and just want to get started doing some basic classificationthat works reasonably well to demonstrate the idea.

The code is all publicly available in Mahout. The Wikipedia data setI'm using is at http://people.apache.org/~gsingers/wikipedia/ (ignorethe small files, the big bz2 file is the one I've used)


 I'm happy to share the commands I used:

1. WikipediaDataSetCreatorDriver: --input PATH/wikipedia/chunks/ --output PATH/wikipedia/subjects/out --categories PATH TO MAHOUT CODE/examples/src/test/resources/subjects.txt

2. TrainClassifier: --input PATH/wikipedia/subjects/out --output PATH/wikipedia/subjects/model --gramSize 3 --classifierType bayes

3. Test Classifier: --model PATH/wikipedia/subjects/model --testDirPATH/wikipedia/subjects/test --gramSize 3 --classifierType bayes

The training data was produced by the Wikipedia Splitter (first 60chunks) and the test data was some other chunks not in the first 60 (Ihaven't successfully completed a Test run yet, or at least not onethat resulted in even decent results)

I suspect the explosion in the number of features, Ted, is due to theuse of n-grams producing a lot of unique terms. I can try w/ gramSize= 1, that will likely reduce the feature set quite a bit.

I am using the WikipediaTokenizer from Lucene which does a better jobof removing cruft from Wikipedia than StandardAnalyzer.

This is all based on me piecing together from the Wiki and the codeand is not on any great insight on my end.


-Grant


On Jul 22, 2009, at 2:24 PM, Ted Dunning wrote:

It is common to have more features than there are plausible words.

If these features are common enough to provide some support for the
statistical inferences, then they are fine to use as long as theyaren'ttarget leaks. If they are rare (page URL for instance), then theyhave
little utility and should be pruned.
Pruning will generally improve accuracy as well as speed and memoryuse.
On Wed, Jul 22, 2009 at 11:19 AM, Robin Anil <[email protected]>wrote:
Yes, I agree. Maybe we can add a prune step or a minSupport parameter
to prune. But then again a lot depends on the tokenizer used.Numerals
plus string literal combinations like say 100-sanfrancisco-ugs found
in Wikipedia data a lot.  They add up to the feature count more than
English words


--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)using Solr/Lucene:

http://www.lucidimagination.com/search

Re: Getting Started with Classification

Reply via email to