I'm doing a pretty naive (pun intended) approach for this based on the
viewpoint of someone coming in new to Mahout and ML, for that matter,
(I also will admit I haven't done a lot of practical classification
myself, even if I've read many of the papers, so it isn't a stretch
for me) and just want to get started doing some basic classification
that works reasonably well to demonstrate the idea.
The code is all publicly available in Mahout. The Wikipedia data set
I'm using is at http://people.apache.org/~gsingers/wikipedia/ (ignore
the small files, the big bz2 file is the one I've used)
I'm happy to share the commands I used:
1. WikipediaDataSetCreatorDriver: --input PATH/wikipedia/chunks/ --
output PATH/wikipedia/subjects/out --categories PATH TO MAHOUT CODE/
examples/src/test/resources/subjects.txt
2. TrainClassifier: --input PATH/wikipedia/subjects/out --output PATH/
wikipedia/subjects/model --gramSize 3 --classifierType bayes
3. Test Classifier: --model PATH/wikipedia/subjects/model --testDir
PATH/wikipedia/subjects/test --gramSize 3 --classifierType bayes
The training data was produced by the Wikipedia Splitter (first 60
chunks) and the test data was some other chunks not in the first 60 (I
haven't successfully completed a Test run yet, or at least not one
that resulted in even decent results)
I suspect the explosion in the number of features, Ted, is due to the
use of n-grams producing a lot of unique terms. I can try w/ gramSize
= 1, that will likely reduce the feature set quite a bit.
I am using the WikipediaTokenizer from Lucene which does a better job
of removing cruft from Wikipedia than StandardAnalyzer.
This is all based on me piecing together from the Wiki and the code
and is not on any great insight on my end.
-Grant
On Jul 22, 2009, at 2:24 PM, Ted Dunning wrote:
It is common to have more features than there are plausible words.
If these features are common enough to provide some support for the
statistical inferences, then they are fine to use as long as they
aren't
target leaks. If they are rare (page URL for instance), then they
have
little utility and should be pruned.
Pruning will generally improve accuracy as well as speed and memory
use.
On Wed, Jul 22, 2009 at 11:19 AM, Robin Anil <[email protected]>
wrote:
Yes, I agree. Maybe we can add a prune step or a minSupport parameter
to prune. But then again a lot depends on the tokenizer used.
Numerals
plus string literal combinations like say 100-sanfrancisco-ugs found
in Wikipedia data a lot. They add up to the feature count more than
English words
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search