Re: Getting Started with Classification

Grant Ingersoll Wed, 22 Jul 2009 13:21:08 -0700


On Jul 22, 2009, at 4:13 PM, Miles Osborne wrote:

it is probably good to benchmark against standard datasets.  for text
classification this tends to be the Reuters set:

http://www.daviddlewis.com/resources/testcollections/

this way you know if you are doing a good job

Yeah, good point. Only problem is, for my demo, I am doing it all onWikipedia, because I want coherent examples and don't want to have tointroduce another dataset. I know there are a few areas for error inthe process, since we are just picking a single category for adocument even though they have multiple, furthermore, we are pickingthe first category that matches, even thought multiple inputcategories might be present, or even, both categories in one (i.e.History of Science)

Still, good to try out w/ the Reuters collection as well. Sigh, I'llput it on the list to do.


Miles

2009/7/22 Grant Ingersoll <[email protected]>

The model size is much smaller with unigrams.  :-)

I'm not quite sure what constitutes good just yet, but, I canreport thefollowing using the commands I reported earlier w/ the exceptionthat I am

using unigrams:

I have two categories:  History and Science

0. Splitter:
org.apache.mahout.classifier.bayes.WikipediaXmlSplitter

--dumpFile PATH/wikipedia/enwiki-20070527-pages-articles.xml --outputDir

/PATH/wikipedia/chunks -c 64

Then prep:
org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver

--input PATH/wikipedia/test-chunks/ --output PATH/wikipedia/subjects/test--categories PATH/mahout-clean/examples/src/test/resources/subjects.txt

(also do this for the training set)

1. Train set:
ls ../chunks
chunk-0001.xml  chunk-0005.xml  chunk-0009.xml  chunk-0013.xml
chunk-0017.xml  chunk-0021.xml  chunk-0025.xml  chunk-0029.xml
chunk-0033.xml  chunk-0037.xml
chunk-0002.xml  chunk-0006.xml  chunk-0010.xml  chunk-0014.xml
chunk-0018.xml  chunk-0022.xml  chunk-0026.xml  chunk-0030.xml
chunk-0034.xml  chunk-0038.xml
chunk-0003.xml  chunk-0007.xml  chunk-0011.xml  chunk-0015.xml
chunk-0019.xml  chunk-0023.xml  chunk-0027.xml  chunk-0031.xml
chunk-0035.xml  chunk-0039.xml
chunk-0004.xml  chunk-0008.xml  chunk-0012.xml  chunk-0016.xml
chunk-0020.xml  chunk-0024.xml  chunk-0028.xml  chunk-0032.xml
chunk-0036.xml

2. Test Set:
ls
chunk-0101.xml  chunk-0103.xml  chunk-0105.xml  chunk-0108.xml
chunk-0130.xml  chunk-0132.xml  chunk-0134.xml  chunk-0137.xml
chunk-0102.xml  chunk-0104.xml  chunk-0107.xml  chunk-0109.xml
chunk-0131.xml  chunk-0133.xml  chunk-0135.xml  chunk-0139.xml

3. Run the Trainer on the train set:

--input PATH/wikipedia/subjects/out --output PATH/wikipedia/subjects/model

--gramSize 1 --classifierType bayes

4. Run the TestClassifier.

--model PATH/wikipedia/subjects/model --testDir
PATH/wikipedia/subjects/test --gramSize 1 --classifierType bayes

Output is:

<snip>
9/07/22 15:55:09 INFO bayes.TestClassifier:
=======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances          :       4143       74.0615%
Incorrectly Classified Instances        :       1451       25.9385%
Total Classified Instances              :       5594

=======================================================
Confusion Matrix
-------------------------------------------------------
a       b       <--Classified as
3910    186      |  4096        a     = history
1265    233      |  1498        b     = science
Default Category: unknown: 2
</snip>

At least it's better than 50%, which is presumably a goodthing ;-) I have

no clue what the state of the art is these days, but it doesn't seem
_horrendous_ either.

I'd love to see someone validate what I have done. Let me know ifyou need

more details.  I'd also like to know how I can improve it.

On Jul 22, 2009, at 3:15 PM, Ted Dunning wrote:

Indeed.  I hadn't snapped to the fact you were using trigrams.

30 million features is quite plausible for that. To effectivelyuse longn-grams as features in classification of documents you really needto have
the following:
a) good statistical methods for resolving what is useful and whatis not.Everybody here knows that my preference for a first hack issparsification
with log-likelihood ratios.

b) some kind of smoothing using smaller n-grams

c) some kind of smoothing over variants of n-grams.
AFAIK, mahout doesn't have many or any of these in place. You arelikely
to
do better with unigrams as a result.

On Wed, Jul 22, 2009 at 11:39 AM, Grant Ingersoll <[email protected]
wrote:
I suspect the explosion in the number of features, Ted, is due tothe use
of n-grams producing a lot of unique terms. I can try w/gramSize = 1,
that
will likely reduce the feature set quite a bit.
--
Ted Dunning, CTO
DeepDyve

--

The University of Edinburgh is a charitable body, registered inScotland,

with registration number SC005336.


--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)using Solr/Lucene:

http://www.lucidimagination.com/search

Re: Getting Started with Classification

Reply via email to