On Jul 22, 2009, at 4:13 PM, Miles Osborne wrote:

it is probably good to benchmark against standard datasets.  for text
classification this tends to be the Reuters set:

http://www.daviddlewis.com/resources/testcollections/

this way you know if you are doing a good job

Yeah, good point. Only problem is, for my demo, I am doing it all on Wikipedia, because I want coherent examples and don't want to have to introduce another dataset. I know there are a few areas for error in the process, since we are just picking a single category for a document even though they have multiple, furthermore, we are picking the first category that matches, even thought multiple input categories might be present, or even, both categories in one (i.e. History of Science)

Still, good to try out w/ the Reuters collection as well. Sigh, I'll put it on the list to do.



Miles

2009/7/22 Grant Ingersoll <[email protected]>

The model size is much smaller with unigrams.  :-)

I'm not quite sure what constitutes good just yet, but, I can report the following using the commands I reported earlier w/ the exception that I am
using unigrams:

I have two categories:  History and Science

0. Splitter:
org.apache.mahout.classifier.bayes.WikipediaXmlSplitter
--dumpFile PATH/wikipedia/enwiki-20070527-pages-articles.xml -- outputDir
/PATH/wikipedia/chunks -c 64

Then prep:
org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver
--input PATH/wikipedia/test-chunks/ --output PATH/wikipedia/ subjects/test --categories PATH/mahout-clean/examples/src/test/resources/ subjects.txt
(also do this for the training set)

1. Train set:
ls ../chunks
chunk-0001.xml  chunk-0005.xml  chunk-0009.xml  chunk-0013.xml
chunk-0017.xml  chunk-0021.xml  chunk-0025.xml  chunk-0029.xml
chunk-0033.xml  chunk-0037.xml
chunk-0002.xml  chunk-0006.xml  chunk-0010.xml  chunk-0014.xml
chunk-0018.xml  chunk-0022.xml  chunk-0026.xml  chunk-0030.xml
chunk-0034.xml  chunk-0038.xml
chunk-0003.xml  chunk-0007.xml  chunk-0011.xml  chunk-0015.xml
chunk-0019.xml  chunk-0023.xml  chunk-0027.xml  chunk-0031.xml
chunk-0035.xml  chunk-0039.xml
chunk-0004.xml  chunk-0008.xml  chunk-0012.xml  chunk-0016.xml
chunk-0020.xml  chunk-0024.xml  chunk-0028.xml  chunk-0032.xml
chunk-0036.xml

2. Test Set:
ls
chunk-0101.xml  chunk-0103.xml  chunk-0105.xml  chunk-0108.xml
chunk-0130.xml  chunk-0132.xml  chunk-0134.xml  chunk-0137.xml
chunk-0102.xml  chunk-0104.xml  chunk-0107.xml  chunk-0109.xml
chunk-0131.xml  chunk-0133.xml  chunk-0135.xml  chunk-0139.xml

3. Run the Trainer on the train set:
--input PATH/wikipedia/subjects/out --output PATH/wikipedia/ subjects/model
--gramSize 1 --classifierType bayes

4. Run the TestClassifier.

--model PATH/wikipedia/subjects/model --testDir
PATH/wikipedia/subjects/test --gramSize 1 --classifierType bayes

Output is:

<snip>
9/07/22 15:55:09 INFO bayes.TestClassifier:
=======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances          :       4143       74.0615%
Incorrectly Classified Instances        :       1451       25.9385%
Total Classified Instances              :       5594

=======================================================
Confusion Matrix
-------------------------------------------------------
a       b       <--Classified as
3910    186      |  4096        a     = history
1265    233      |  1498        b     = science
Default Category: unknown: 2
</snip>

At least it's better than 50%, which is presumably a good thing ;-) I have
no clue what the state of the art is these days, but it doesn't seem
_horrendous_ either.

I'd love to see someone validate what I have done. Let me know if you need
more details.  I'd also like to know how I can improve it.

On Jul 22, 2009, at 3:15 PM, Ted Dunning wrote:

Indeed.  I hadn't snapped to the fact you were using trigrams.

30 million features is quite plausible for that. To effectively use long n-grams as features in classification of documents you really need to have
the following:

a) good statistical methods for resolving what is useful and what is not. Everybody here knows that my preference for a first hack is sparsification
with log-likelihood ratios.

b) some kind of smoothing using smaller n-grams

c) some kind of smoothing over variants of n-grams.

AFAIK, mahout doesn't have many or any of these in place. You are likely
to
do better with unigrams as a result.

On Wed, Jul 22, 2009 at 11:39 AM, Grant Ingersoll <[email protected]
wrote:

I suspect the explosion in the number of features, Ted, is due to the use
of n-grams producing a lot of unique terms. I can try w/ gramSize = 1,
that
will likely reduce the feature set quite a bit.




--
Ted Dunning, CTO
DeepDyve






--
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336.

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene:
http://www.lucidimagination.com/search

Reply via email to