Hi Philippe,
I could reply this week It has been hectic. I ran the Mahout
Classifier on your data split Here is the result. Its giving 84.4% accuracy
which is close to what MALLET gave you. But then again, I am using the
latest changes in the CNB classifier. I would really like to know which
version of the CNB classifier PATCH (
https://issues.apache.org/jira/browse/MAHOUT-60) you used. So i can figure
out what went wrong.
sci.med 92.39905
talk.politics.guns 93.5867
talk.politics.mideast 96.67458
rec.sport.baseball 76.47059
comp.os.ms-windows.misc 70.07126
comp.sys.mac.hardware 78.14727
talk.religion.misc 78.26087
sci.electronics 78.645836
comp.graphics 81.23515
sci.crypt 92.87411
comp.sys.ibm.pc.hardware 70.54632
rec.motorcycles 94.299286
rec.sport.hockey 70.17544
talk.politics.misc 90.67796
alt.atheism 82.24299
soc.religion.christian 86.93587
sci.space 93.03797
misc.forsale 33.884296
comp.windows.x 85.27316
rec.autos 88.12351
=======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances : 5440 84.472%
Incorrectly Classified Instances : 1000 15.528%
Total Classified Instances : 6440
=======================================================
Confusion Matrix
-------------------------------------------------------
a b c d e f g h i
j k l m n o p q
r s t <--Classified as
397 1 1 2 0 0 1 0 0
14 1 0 0 2 0 0 1
0 1 0 | 421 a = rec.motorcycles
5 41 5 4 3 0 5 0 1
7 3 11 12 3 0 1 7
10 2 1 | 121 b = misc.forsale
3 2 295 4 15 1 5 0 3
0 2 6 31 0 0 2 2
44 5 1 | 421 c = comp.os.ms-windows.misc
0 0 1 321 0 0 1 1 5
0 2 0 0 12 0 7 0
0 2 2 | 354 d = talk.politics.misc
2 0 4 0 359 0 3 0 1
1 5 1 1 1 0 2 0
36 5 0 | 421 e = comp.windows.x
0 0 0 6 0 366 2 0 31
0 0 0 0 3 0 4 0
1 1 7 | 421 f = soc.religion.christian
0 0 0 6 0 2 389 0 0
0 6 0 0 2 0 5 4
4 3 0 | 421 g = sci.med
0 0 0 9 1 0 6 65 0
0 1 0 0 1 0 1 0
0 0 1 | 85 h = rec.sport.baseball
0 0 0 11 1 17 2 0 162
0 0 0 0 3 0 3 0
0 0 8 | 207 i = talk.religion.misc
10 1 1 7 0 0 3 0 3
371 2 2 3 3 0 4 5
3 3 0 | 421 j = rec.autos
1 0 0 1 0 0 0 0 3
1 147 0 0 1 0 1 1
1 1 0 | 158 k = sci.space
0 0 13 1 10 0 5 0 0
0 4 329 11 1 0 0 14
26 7 0 | 421 l = comp.sys.mac.hardware
0 2 30 1 11 0 2 1 0
1 3 25 297 0 0 3 13
22 10 0 | 421 m = comp.sys.ibm.pc.hardware
1 0 1 15 0 0 1 0 3
0 1 0 0 394 0 2 0
0 2 1 | 421 n = talk.politics.guns
1 0 0 8 0 0 1 1 0
0 0 0 0 1 40 2 0
3 0 0 | 57 o = rec.sport.hockey
0 0 0 7 2 1 0 0 0
0 1 0 0 3 0 407 0
0 0 0 | 421 p = talk.politics.mideast
0 3 3 0 1 0 6 0 0
3 2 6 4 1 0 0 151
9 3 0 | 192 q = sci.electronics
1 0 7 0 34 2 6 0 1
0 2 3 8 3 0 1 5
342 5 1 | 421 r = comp.graphics
0 0 1 7 3 0 6 0 2
0 0 2 0 5 0 0 0
4 391 0 | 421 s = sci.crypt
0 0 0 3 1 8 0 0 20
0 0 0 0 0 1 4 0
0 1 176 | 214 t = alt.atheism
real 0m41.406s
user 0m39.674s
sys 0m0.612s
On Tue, Jul 22, 2008 at 5:12 AM, Philippe Lamarche <
[EMAIL PROTECTED]> wrote:
> Hi,
>
> I just tried it with Mallet.
>
> http://mallet.cs.umass.edu/index.php/Main_Page
>
> I used the same training and testing files (on the 20News corpus) and
> got an 85% prediction accuracy.
>
> However, I also tired it on Mallet with my usual Enron corpus and only
> got a 50% accuracy.
>
> I would say that there is probably something wrong with the Mahout
> classifier implementation. Also, probably that the training data that
> I use with the Enron data-set is not distinct enough to be used with a
> Bayesian classifier.
>
> Any ideas?
>
> Thanks,
>
> Philippe.
>
>
> On Sun, Jul 20, 2008 at 11:23 AM, Philippe Lamarche
> <[EMAIL PROTECTED]> wrote:
> > Hi,
> >
> > I uploaded my split here:
> >
> > http://www.2shared.com/file/3624998/e9330a64/news-train-testtar.html
> >
> > (the download link is after all the ads, at the bottom of the page)
> >
> > The file contains the "news_test_1" and "news_train_1" folders, with
> > the original file/folder structure. The "news_ha_train_1" folder
> > contains the collapse version of "news_train_1".
> >
> > The training files are not perfectly distributed in each class (some
> > class will contain less training file than other). This was done to
> > reflect the UC Berkeley Enron corpus.
> >
> > Thanks,
> > Philippe.
> >
> >
> > On Sun, Jul 20, 2008 at 10:08 AM, Grant Ingersoll <[EMAIL PROTECTED]>
> wrote:
> >> I haven't done a lot of testing w/ M-9 yet, so it is more than likely
> there
> >> are bugs ;-)
> >>
> >> -Grant
> >>
> >> On Jul 20, 2008, at 6:21 AM, Miles Osborne wrote:
> >>
> >>> i think it would also be useful to cross-check your results against a
> text
> >>> classification system which is known to work. look at rainbow:
> >>>
> >>> http://www.cs.cmu.edu/~mccallum/bow/rainbow/<http://www.cs.cmu.edu/%7Emccallum/bow/rainbow/>
> >>>
> >>> if you get the correct results here then either you have somehow
> messed-up
> >>> with Mahout or else there really is a bug
> >>>
> >>> Miles
> >>>
> >>> 2008/7/20 Robin Anil <[EMAIL PROTECTED]>:
> >>>
> >>>> Can you upload your split somewhere.
> >>>>
> >>>> On Sun, Jul 20, 2008 at 6:46 AM, Philippe Lamarche <
> >>>> [EMAIL PROTECTED]> wrote:
> >>>>
> >>>>> Now, with the attachment.
> >>>>> Sorry.
> >>>>>
> >>>>> On Sat, Jul 19, 2008 at 9:13 PM, Philippe Lamarche
> >>>>> <[EMAIL PROTECTED]> wrote:
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> I have been working for a little while with Mahout and the Bayesian
> >>>>>> classifier for a school project.
> >>>>>>
> >>>>>> I am using the Enron email corpus and the UC Berkeley classified
> >>>>>> emails (http://www.cs.cmu.edu/~enron/<http://www.cs.cmu.edu/%7Eenron/>
> <http://www.cs.cmu.edu/%7Eenron/><
> >>>>
> >>>> http://www.cs.cmu.edu/%7Eenron/>).
> >>>>>
> >>>>> I did a few tests and I can't
> >>>>>>
> >>>>>> seem to make it work. I wonder if I am doing something wrong.
> >>>>>>
> >>>>>> For example, I am getting correct prediction under 10%, with Bayes
> and
> >>>>>> around 1% with CBayes. The problem seems to lie in the fact that all
> >>>>>> instances of a class will be predicted to another class, or that
> they
> >>>>>> will all be predicted to the class containing the more feature.
> >>>>>>
> >>>>>> I also tested with the 20News corpus and I get similar result where
> >>>>>> all instances of a class will be predicted to another class. (e.g.
> all
> >>>>>> 421 "rec.motorcycles" get predicted as "talk.politics.mideast").
> >>>>>> Attached is two confusions matrix displaying results for bayes and
> >>>>>> cbayes. Both used the same division in the training and testing set.
> >>>>>>
> >>>>>> Am I doing something wrong?
> >>>>>>
> >>>>>> Thanks,
> >>>>>>
> >>>>>> Philippe Lamarche.
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>> Thanks
> >>>> Robin
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> The University of Edinburgh is a charitable body, registered in
> Scotland,
> >>> with registration number SC005336.
> >>
> >> --------------------------
> >> Grant Ingersoll
> >> http://www.lucidimagination.com
> >>
> >> Lucene Helpful Hints:
> >> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> >> http://wiki.apache.org/lucene-java/LuceneFAQ
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >
>
--
Robin Anil
Senior Dual Degree Student
Department of Computer Science & Engineering
IIT Kharagpur
--------------------------------------------------------------------------------------------
techdigger.wordpress.com
A discursive take on the world around us
www.minekey.com
You Might Like This
www.ithink.com
Express Yourself