Re: Problems with the Bayesian classifiers.

Robin Anil Sun, 27 Jul 2008 08:06:06 -0700

Hi Philippe,
             I could reply this week It has been hectic. I ran the Mahout
Classifier on your data split Here is the result. Its giving 84.4% accuracy
which is close to what MALLET gave you. But then again, I am using the
latest changes in the CNB classifier. I would really like to know which
version of the CNB classifier PATCH (
https://issues.apache.org/jira/browse/MAHOUT-60) you used. So i can figure
out what went wrong.


sci.med    92.39905
talk.politics.guns    93.5867
talk.politics.mideast    96.67458
rec.sport.baseball    76.47059
comp.os.ms-windows.misc    70.07126
comp.sys.mac.hardware    78.14727
talk.religion.misc    78.26087
sci.electronics    78.645836
comp.graphics    81.23515
sci.crypt    92.87411
comp.sys.ibm.pc.hardware    70.54632
rec.motorcycles    94.299286
rec.sport.hockey    70.17544
talk.politics.misc    90.67796
alt.atheism    82.24299
soc.religion.christian    86.93587
sci.space    93.03797
misc.forsale    33.884296
comp.windows.x    85.27316
rec.autos    88.12351
=======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances          :       5440        84.472%
Incorrectly Classified Instances        :       1000        15.528%
Total Classified Instances              :       6440

=======================================================
Confusion Matrix
-------------------------------------------------------
a        b        c        d        e        f        g        h        i
    j        k        l        m        n        o        p        q
r        s        t        <--Classified as
397      1        1        2        0        0        1        0        0
    14       1        0        0        2        0        0        1
0        1        0         |  421       a     = rec.motorcycles
5        41       5        4        3        0        5        0        1
    7        3        11       12       3        0        1        7
10       2        1         |  121       b     = misc.forsale
3        2        295      4        15       1        5        0        3
    0        2        6        31       0        0        2        2
44       5        1         |  421       c     = comp.os.ms-windows.misc
0        0        1        321      0        0        1        1        5
    0        2        0        0        12       0        7        0
0        2        2         |  354       d     = talk.politics.misc
2        0        4        0        359      0        3        0        1
    1        5        1        1        1        0        2        0
36       5        0         |  421       e     = comp.windows.x
0        0        0        6        0        366      2        0        31
    0        0        0        0        3        0        4        0
1        1        7         |  421       f     = soc.religion.christian
0        0        0        6        0        2        389      0        0
    0        6        0        0        2        0        5        4
4        3        0         |  421       g     = sci.med
0        0        0        9        1        0        6        65       0
    0        1        0        0        1        0        1        0
0        0        1         |  85        h     = rec.sport.baseball
0        0        0        11       1        17       2        0        162
    0        0        0        0        3        0        3        0
0        0        8         |  207       i     = talk.religion.misc
10       1        1        7        0        0        3        0        3
    371      2        2        3        3        0        4        5
3        3        0         |  421       j     = rec.autos
1        0        0        1        0        0        0        0        3
    1        147      0        0        1        0        1        1
1        1        0         |  158       k     = sci.space
0        0        13       1        10       0        5        0        0
    0        4        329      11       1        0        0        14
26       7        0         |  421       l     = comp.sys.mac.hardware
0        2        30       1        11       0        2        1        0
    1        3        25       297      0        0        3        13
22       10       0         |  421       m     = comp.sys.ibm.pc.hardware
1        0        1        15       0        0        1        0        3
    0        1        0        0        394      0        2        0
0        2        1         |  421       n     = talk.politics.guns
1        0        0        8        0        0        1        1        0
    0        0        0        0        1        40       2        0
3        0        0         |  57        o     = rec.sport.hockey
0        0        0        7        2        1        0        0        0
    0        1        0        0        3        0        407      0
0        0        0         |  421       p     = talk.politics.mideast
0        3        3        0        1        0        6        0        0
    3        2        6        4        1        0        0        151
9        3        0         |  192       q     = sci.electronics
1        0        7        0        34       2        6        0        1
    0        2        3        8        3        0        1        5
342      5        1         |  421       r     = comp.graphics
0        0        1        7        3        0        6        0        2
    0        0        2        0        5        0        0        0
4        391      0         |  421       s     = sci.crypt
0        0        0        3        1        8        0        0        20
    0        0        0        0        0        1        4        0
0        1        176       |  214       t     = alt.atheism



real    0m41.406s
user    0m39.674s
sys    0m0.612s


On Tue, Jul 22, 2008 at 5:12 AM, Philippe Lamarche <
[EMAIL PROTECTED]> wrote:

>  Hi,
>
> I just tried it with Mallet.
>
> http://mallet.cs.umass.edu/index.php/Main_Page
>
> I used the same training and testing files (on the 20News corpus) and
> got an 85% prediction accuracy.
>
> However, I also tired it on Mallet with my usual Enron corpus and only
> got a 50% accuracy.
>
> I would say that there is probably something wrong with the Mahout
> classifier implementation. Also, probably that the training data that
> I use with the Enron data-set is not distinct enough to be used with a
> Bayesian classifier.
>
> Any ideas?
>
> Thanks,
>
> Philippe.
>
>
> On Sun, Jul 20, 2008 at 11:23 AM, Philippe Lamarche
> <[EMAIL PROTECTED]> wrote:
> >  Hi,
> >
> > I uploaded my split here:
> >
> > http://www.2shared.com/file/3624998/e9330a64/news-train-testtar.html
> >
> > (the download link is after all the ads, at the bottom of the page)
> >
> > The file contains the "news_test_1" and "news_train_1" folders, with
> > the original file/folder structure. The "news_ha_train_1" folder
> > contains the collapse version of "news_train_1".
> >
> > The training files are not perfectly distributed in each class (some
> > class will contain less training file than other). This was done to
> > reflect the UC Berkeley Enron corpus.
> >
> > Thanks,
> > Philippe.
> >
> >
> > On Sun, Jul 20, 2008 at 10:08 AM, Grant Ingersoll <[EMAIL PROTECTED]>
> wrote:
> >> I haven't done a lot of testing w/ M-9 yet, so it is more than likely
> there
> >> are bugs ;-)
> >>
> >> -Grant
> >>
> >> On Jul 20, 2008, at 6:21 AM, Miles Osborne wrote:
> >>
> >>> i think it would also be useful to cross-check your results against a
> text
> >>> classification system which is known to work.  look at rainbow:
> >>>
> >>> http://www.cs.cmu.edu/~mccallum/bow/rainbow/<http://www.cs.cmu.edu/%7Emccallum/bow/rainbow/>
> >>>
> >>> if you get the correct results here then either you have somehow
> messed-up
> >>> with Mahout or else there really is a bug
> >>>
> >>> Miles
> >>>
> >>> 2008/7/20 Robin Anil <[EMAIL PROTECTED]>:
> >>>
> >>>> Can you upload your split somewhere.
> >>>>
> >>>> On Sun, Jul 20, 2008 at 6:46 AM, Philippe Lamarche <
> >>>> [EMAIL PROTECTED]> wrote:
> >>>>
> >>>>> Now, with the attachment.
> >>>>> Sorry.
> >>>>>
> >>>>> On Sat, Jul 19, 2008 at 9:13 PM, Philippe Lamarche
> >>>>> <[EMAIL PROTECTED]> wrote:
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> I have been working for a little while with Mahout and the Bayesian
> >>>>>> classifier for a school project.
> >>>>>>
> >>>>>> I am using the Enron email corpus and the UC Berkeley classified
> >>>>>> emails (http://www.cs.cmu.edu/~enron/<http://www.cs.cmu.edu/%7Eenron/>
> <http://www.cs.cmu.edu/%7Eenron/><
> >>>>
> >>>> http://www.cs.cmu.edu/%7Eenron/>).
> >>>>>
> >>>>> I did a few tests and I can't
> >>>>>>
> >>>>>> seem to make it work. I wonder if I am doing something wrong.
> >>>>>>
> >>>>>> For example, I am getting correct prediction under 10%, with Bayes
> and
> >>>>>> around 1% with CBayes. The problem seems to lie in the fact that all
> >>>>>> instances of a class will be predicted to another class, or that
> they
> >>>>>> will all be predicted to the class containing the more feature.
> >>>>>>
> >>>>>> I also tested with the 20News corpus and I get similar result where
> >>>>>> all instances of a class will be predicted to another class. (e.g.
> all
> >>>>>> 421 "rec.motorcycles" get predicted as "talk.politics.mideast").
> >>>>>> Attached is two confusions matrix displaying results for bayes and
> >>>>>> cbayes. Both used the same division in the training and testing set.
> >>>>>>
> >>>>>> Am I doing something wrong?
> >>>>>>
> >>>>>> Thanks,
> >>>>>>
> >>>>>> Philippe Lamarche.
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>> Thanks
> >>>> Robin
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> The University of Edinburgh is a charitable body, registered in
> Scotland,
> >>> with registration number SC005336.
> >>
> >> --------------------------
> >> Grant Ingersoll
> >> http://www.lucidimagination.com
> >>
> >> Lucene Helpful Hints:
> >> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> >> http://wiki.apache.org/lucene-java/LuceneFAQ
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >
>



-- 
Robin Anil
Senior Dual Degree Student
Department of Computer Science & Engineering
IIT Kharagpur

--------------------------------------------------------------------------------------------
techdigger.wordpress.com
A discursive take on the world around us

www.minekey.com
You Might Like This

www.ithink.com
Express Yourself

Re: Problems with the Bayesian classifiers.

Reply via email to