On Jul 22, 2009, at 4:13 PM, Miles Osborne wrote:
it is probably good to benchmark against standard datasets. for text
classification this tends to be the Reuters set:
http://www.daviddlewis.com/resources/testcollections/
this way you know if you are doing a good job
Yeah, good point. Only problem is, for my demo, I am doing it all on
Wikipedia, because I want coherent examples and don't want to have to
introduce another dataset. I know there are a few areas for error in
the process, since we are just picking a single category for a
document even though they have multiple, furthermore, we are picking
the first category that matches, even thought multiple input
categories might be present, or even, both categories in one (i.e.
History of Science)
Still, good to try out w/ the Reuters collection as well. Sigh, I'll
put it on the list to do.
Miles
2009/7/22 Grant Ingersoll <[email protected]>
The model size is much smaller with unigrams. :-)
I'm not quite sure what constitutes good just yet, but, I can
report the
following using the commands I reported earlier w/ the exception
that I am
using unigrams:
I have two categories: History and Science
0. Splitter:
org.apache.mahout.classifier.bayes.WikipediaXmlSplitter
--dumpFile PATH/wikipedia/enwiki-20070527-pages-articles.xml --
outputDir
/PATH/wikipedia/chunks -c 64
Then prep:
org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver
--input PATH/wikipedia/test-chunks/ --output PATH/wikipedia/
subjects/test
--categories PATH/mahout-clean/examples/src/test/resources/
subjects.txt
(also do this for the training set)
1. Train set:
ls ../chunks
chunk-0001.xml chunk-0005.xml chunk-0009.xml chunk-0013.xml
chunk-0017.xml chunk-0021.xml chunk-0025.xml chunk-0029.xml
chunk-0033.xml chunk-0037.xml
chunk-0002.xml chunk-0006.xml chunk-0010.xml chunk-0014.xml
chunk-0018.xml chunk-0022.xml chunk-0026.xml chunk-0030.xml
chunk-0034.xml chunk-0038.xml
chunk-0003.xml chunk-0007.xml chunk-0011.xml chunk-0015.xml
chunk-0019.xml chunk-0023.xml chunk-0027.xml chunk-0031.xml
chunk-0035.xml chunk-0039.xml
chunk-0004.xml chunk-0008.xml chunk-0012.xml chunk-0016.xml
chunk-0020.xml chunk-0024.xml chunk-0028.xml chunk-0032.xml
chunk-0036.xml
2. Test Set:
ls
chunk-0101.xml chunk-0103.xml chunk-0105.xml chunk-0108.xml
chunk-0130.xml chunk-0132.xml chunk-0134.xml chunk-0137.xml
chunk-0102.xml chunk-0104.xml chunk-0107.xml chunk-0109.xml
chunk-0131.xml chunk-0133.xml chunk-0135.xml chunk-0139.xml
3. Run the Trainer on the train set:
--input PATH/wikipedia/subjects/out --output PATH/wikipedia/
subjects/model
--gramSize 1 --classifierType bayes
4. Run the TestClassifier.
--model PATH/wikipedia/subjects/model --testDir
PATH/wikipedia/subjects/test --gramSize 1 --classifierType bayes
Output is:
<snip>
9/07/22 15:55:09 INFO bayes.TestClassifier:
=======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances : 4143 74.0615%
Incorrectly Classified Instances : 1451 25.9385%
Total Classified Instances : 5594
=======================================================
Confusion Matrix
-------------------------------------------------------
a b <--Classified as
3910 186 | 4096 a = history
1265 233 | 1498 b = science
Default Category: unknown: 2
</snip>
At least it's better than 50%, which is presumably a good
thing ;-) I have
no clue what the state of the art is these days, but it doesn't seem
_horrendous_ either.
I'd love to see someone validate what I have done. Let me know if
you need
more details. I'd also like to know how I can improve it.
On Jul 22, 2009, at 3:15 PM, Ted Dunning wrote:
Indeed. I hadn't snapped to the fact you were using trigrams.
30 million features is quite plausible for that. To effectively
use long
n-grams as features in classification of documents you really need
to have
the following:
a) good statistical methods for resolving what is useful and what
is not.
Everybody here knows that my preference for a first hack is
sparsification
with log-likelihood ratios.
b) some kind of smoothing using smaller n-grams
c) some kind of smoothing over variants of n-grams.
AFAIK, mahout doesn't have many or any of these in place. You are
likely
to
do better with unigrams as a result.
On Wed, Jul 22, 2009 at 11:39 AM, Grant Ingersoll <[email protected]
wrote:
I suspect the explosion in the number of features, Ted, is due to
the use
of n-grams producing a lot of unique terms. I can try w/
gramSize = 1,
that
will likely reduce the feature set quite a bit.
--
Ted Dunning, CTO
DeepDyve
--
The University of Edinburgh is a charitable body, registered in
Scotland,
with registration number SC005336.
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search