Re: CNB: Learning from Huge Datasets

Grant Ingersoll Tue, 29 Jul 2008 06:59:24 -0700

Hi Robin,

Haven't looked at the patch to see if it is in there already, butcould you share your test code? I think it would make for a good demoif people could just be pointed at the code plus a version ofWikipedia (that's the data set you used, right?) and could then makethe run themselves. Would also be good to "wikify" it as docs.


-Grant

On Jul 28, 2008, at 6:26 AM, Robin Anil wrote:

Apparently. It was overfitting. I used the Test-Train split given by
Phillipe in mahout-user list.

When the algorithm was storing the weights of all the words in the
Complementary Class - The Accuracy over the Test set was 90.2% andthe overthat of the Train set itself was 99.32%. But the Size of the Model~= Number
of features x Number of labels

When the algorithm was storing the weights of just the words in the
Non-Complementary Class - The Accuracy over the Test set was 84.47%and that
over the Train set was 99.90%.  The Model becomes a sparse Matrix.

So i guess I will have to go back to the earlier method.
On Sat, Jul 12, 2008 at 11:54 AM, Robin Anil <[EMAIL PROTECTED]>wrote:
It too soon for celebrations. This quick hack might have increasedover
fitting. Keep fingers crossed

Robin


On Sat, Jul 12, 2008 at 11:51 AM, Ted Dunning <[EMAIL PROTECTED]>
wrote:
Well done!

On Fri, Jul 11, 2008 at 11:18 PM, Robin Anil <[EMAIL PROTECTED]>
wrote:
The self classification accuracy on the 20Newsgroups jumped from98.2 to
99.87. And it solved the dense matrix problem also


--------------------------
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Re: CNB: Learning from Huge Datasets

Reply via email to