Hi,

I just tried it with Mallet.

http://mallet.cs.umass.edu/index.php/Main_Page

I used the same training and testing files (on the 20News corpus) and
got an 85% prediction accuracy.

However, I also tired it on Mallet with my usual Enron corpus and only
got a 50% accuracy.

I would say that there is probably something wrong with the Mahout
classifier implementation. Also, probably that the training data that
I use with the Enron data-set is not distinct enough to be used with a
Bayesian classifier.

Any ideas?

Thanks,

Philippe.


On Sun, Jul 20, 2008 at 11:23 AM, Philippe Lamarche
<[EMAIL PROTECTED]> wrote:
>  Hi,
>
> I uploaded my split here:
>
> http://www.2shared.com/file/3624998/e9330a64/news-train-testtar.html
>
> (the download link is after all the ads, at the bottom of the page)
>
> The file contains the "news_test_1" and "news_train_1" folders, with
> the original file/folder structure. The "news_ha_train_1" folder
> contains the collapse version of "news_train_1".
>
> The training files are not perfectly distributed in each class (some
> class will contain less training file than other). This was done to
> reflect the UC Berkeley Enron corpus.
>
> Thanks,
> Philippe.
>
>
> On Sun, Jul 20, 2008 at 10:08 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
>> I haven't done a lot of testing w/ M-9 yet, so it is more than likely there
>> are bugs ;-)
>>
>> -Grant
>>
>> On Jul 20, 2008, at 6:21 AM, Miles Osborne wrote:
>>
>>> i think it would also be useful to cross-check your results against a text
>>> classification system which is known to work.  look at rainbow:
>>>
>>> http://www.cs.cmu.edu/~mccallum/bow/rainbow/
>>>
>>> if you get the correct results here then either you have somehow messed-up
>>> with Mahout or else there really is a bug
>>>
>>> Miles
>>>
>>> 2008/7/20 Robin Anil <[EMAIL PROTECTED]>:
>>>
>>>> Can you upload your split somewhere.
>>>>
>>>> On Sun, Jul 20, 2008 at 6:46 AM, Philippe Lamarche <
>>>> [EMAIL PROTECTED]> wrote:
>>>>
>>>>> Now, with the attachment.
>>>>> Sorry.
>>>>>
>>>>> On Sat, Jul 19, 2008 at 9:13 PM, Philippe Lamarche
>>>>> <[EMAIL PROTECTED]> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I have been working for a little while with Mahout and the Bayesian
>>>>>> classifier for a school project.
>>>>>>
>>>>>> I am using the Enron email corpus and the UC Berkeley classified
>>>>>> emails (http://www.cs.cmu.edu/~enron/<http://www.cs.cmu.edu/%7Eenron/><
>>>>
>>>> http://www.cs.cmu.edu/%7Eenron/>).
>>>>>
>>>>> I did a few tests and I can't
>>>>>>
>>>>>> seem to make it work. I wonder if I am doing something wrong.
>>>>>>
>>>>>> For example, I am getting correct prediction under 10%, with Bayes and
>>>>>> around 1% with CBayes. The problem seems to lie in the fact that all
>>>>>> instances of a class will be predicted to another class, or that they
>>>>>> will all be predicted to the class containing the more feature.
>>>>>>
>>>>>> I also tested with the 20News corpus and I get similar result where
>>>>>> all instances of a class will be predicted to another class. (e.g. all
>>>>>> 421 "rec.motorcycles" get predicted as "talk.politics.mideast").
>>>>>> Attached is two confusions matrix displaying results for bayes and
>>>>>> cbayes. Both used the same division in the training and testing set.
>>>>>>
>>>>>> Am I doing something wrong?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Philippe Lamarche.
>>>>>>
>>>>>
>>>>
>>>>
>>>> Thanks
>>>> Robin
>>>>
>>>
>>>
>>>
>>> --
>>> The University of Edinburgh is a charitable body, registered in Scotland,
>>> with registration number SC005336.
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com
>>
>> Lucene Helpful Hints:
>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>
>>
>>
>>
>>
>>
>>
>>
>

Reply via email to