Re: Getting Started with Classification

Robin Anil Wed, 22 Jul 2009 08:51:52 -0700

On Wed, Jul 22, 2009 at 8:55 PM, Grant Ingersoll <[email protected]>wrote:


>
> On Jul 22, 2009, at 10:38 AM, Robin Anil wrote:
>
>  Dear Grant,               Could you post some stats like the number of
>> labels and features that you have and the number of unique label,feature
>> pair.
>>
>
> labels: history and science
> Docs trained on: chunk 1 - 60 generated using the Wikipedia Splitter with
> the WikipediaAnalyzer (MAHOUT-146) with chunk size set to 64
>
> Where are the <label,feature> values stored?
>

tf-Idf Folder part-****


>
>
>  Both Naive bayes and Complementary naive bayes use the same data
>> except Sigma_j set.
>>
>
> So, why do I need to load it or even calculate it if I am using Bayes?  I
> think I would like to have the choice.  That is, if I plan on using both,
> then I can calculate/load both.  At a minimum, when classifying with Bayes,
> we should not be loading it, even if we did calculate it.
>
> Could you add some writeup on http://cwiki.apache.org/MAHOUT/bayesian.html 
> about
> the steps that are taken?  Also, I've read the CNB paper, but do you have a
> reference for the NB part using many of these values?
>

Sure. I will ASAP

>
>
>  But regardless the matrix stored is sparse. I am not
>> surprised  that with a larger set like that you have taken, memory limit
>> was
>> crossed. Another thing the number of unique terms in wikipedia is quite
>> large. So best choice for you right now is to use the Hbase solution. The
>> large matrix is stored easily on the it. I am currently writing the
>> Distributed version of Hbase classification for parallelizing.
>>
>>
> HBase isn't an option right now, as it isn't committed and I'm putting
> together a demo on current capabilities.
>
>
>
>  Robin
>>
>> On Wed, Jul 22, 2009 at 4:53 PM, Grant Ingersoll <[email protected]
>> >wrote:
>>
>>  The other thing is, I don't even think Sigma_J is even used for Bayes,
>>> only
>>> Complementary Bayes.
>>>
>>>
>>>
>>> On Jul 22, 2009, at 7:16 AM, Grant Ingersoll wrote:
>>>
>>> AFAICT, It is loading the Sum Feature Weights, stored in the Sigma_J
>>>
>>>> directory under the model.  For me, this file is 1.04 GB.  The values in
>>>> this file are loaded into a List of Doubles (which brings with it a
>>>> whole
>>>> log of auto-boxing, too).  It seems like that should fit in memory,
>>>> especially since it is the first thing loaded, AFAICT.  I have not
>>>> looked
>>>> yet into the structure of the file itself.
>>>>
>>>> I guess I will have to dig deeper, this code has changed a lot from when
>>>> I
>>>> first wrote it as a very simple naive bayes model to one that now
>>>> appears to
>>>> be weighted by TF-IDF, normalization, etc. and I need to understand it
>>>> better.
>>>>
>>>> On Jul 22, 2009, at 12:26 AM, Ted Dunning wrote:
>>>>
>>>> This is kind of surprising.  It would seem that this model shouldn't
>>>> have
>>>>
>>>>> more than a few doubles per unique term and there should be <half a
>>>>> million
>>>>> terms.  Even with pretty evil data structures, this really shouldn't be
>>>>> more
>>>>> than a few hundred megs for the model alone.
>>>>>
>>>>> Sparsity *is* a virtue with these models and I always try to eliminate
>>>>> terms
>>>>> that might as well have zero value, but that doesn't sound like the
>>>>> root
>>>>> problem here.
>>>>>
>>>>> Regarding strings or Writables, strings have the wonderful
>>>>> characteristic
>>>>> that they cache their hashed value.  This means that hash maps are
>>>>> nearly
>>>>> as
>>>>> fast as arrays because you wind up indexing to nearly the right place
>>>>> and
>>>>> then do a few (or one) integer compare to find the right value.  Custom
>>>>> data
>>>>> types rarely do this and thus wind up slow.
>>>>>
>>>>> On Tue, Jul 21, 2009 at 7:41 PM, Grant Ingersoll <[email protected]
>>>>>
>>>>>> wrote:
>>>>>>
>>>>>
>>>>> I trained on a couple of categories (history and science) on quite a
>>>>> few
>>>>>
>>>>>> docs, but now the model is so big, I can't load it, even with almost 3
>>>>>> GB of
>>>>>> memory.
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ted Dunning, CTO
>>>>> DeepDyve
>>>>>
>>>>>
>>>>
>>>>
>>>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>

Re: Getting Started with Classification

Reply via email to