Re: CNB: Learning from Huge Datasets

Robin Anil Thu, 31 Jul 2008 07:15:47 -0700

Hi Grant,
                I have been puzzled the past few days by the Hadoop was
calculating.  I calculate Math.log( (Sigma_j + 1)/ (Sigma_kSigma_j - Sigma_k
+ Vocab Count))        using multiple Map-Reduce. I get some value which is
0.5 lower than when i get these when calculated Directly in a Simple
Process. Are there some errors introduced when saving Float Writable is
saved and read back from a file that you know of?


Robin


On Tue, Jul 29, 2008 at 7:28 PM, Grant Ingersoll <[EMAIL PROTECTED]>wrote:

> Hi Robin,
>
> Haven't looked at the patch to see if it is in there already, but could you
> share your test code?  I think it would make for a good demo if people could
> just be pointed at the code plus a version of Wikipedia (that's the data set
> you used, right?) and could then make the run themselves.  Would also be
> good to "wikify" it as docs.
>
> -Grant
>
>
> On Jul 28, 2008, at 6:26 AM, Robin Anil wrote:
>
>  Apparently. It was overfitting. I used the Test-Train split given by
>> Phillipe in mahout-user list.
>>
>> When the algorithm was storing the weights of all the words in the
>> Complementary Class - The Accuracy over the Test set was 90.2% and the
>> over
>> that of the Train set itself was 99.32%. But the Size of the Model ~=
>> Number
>> of features x Number of labels
>>
>> When the algorithm was storing the weights of just the words in the
>> Non-Complementary Class - The Accuracy over the Test set was 84.47% and
>> that
>> over the Train set was 99.90%.  The Model becomes a sparse Matrix.
>>
>> So i guess I will have to go back to the earlier method.
>>
>>
>>
>> On Sat, Jul 12, 2008 at 11:54 AM, Robin Anil <[EMAIL PROTECTED]>
>> wrote:
>>
>>  It too soon for celebrations. This quick hack might have increased over
>>> fitting. Keep fingers crossed
>>>
>>> Robin
>>>
>>>
>>> On Sat, Jul 12, 2008 at 11:51 AM, Ted Dunning <[EMAIL PROTECTED]>
>>> wrote:
>>>
>>>  Well done!
>>>>
>>>> On Fri, Jul 11, 2008 at 11:18 PM, Robin Anil <[EMAIL PROTECTED]>
>>>> wrote:
>>>>
>>>>
>>>>>
>>>>> The self classification accuracy on the 20Newsgroups jumped from 98.2
>>>>> to
>>>>> 99.87. And it solved the dense matrix problem also
>>>>>
>>>>>
>>>>
>>>
>>>
>>>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
>
>


-- 
Robin Anil
Senior Dual Degree Student
Department of Computer Science & Engineering
IIT Kharagpur

--------------------------------------------------------------------------------------------
techdigger.wordpress.com
A discursive take on the world around us

www.minekey.com
You Might Like This

www.ithink.com
Express Yourself

Re: CNB: Learning from Huge Datasets

Reply via email to