On Wed, Jul 22, 2009 at 8:55 PM, Grant Ingersoll <[email protected]>wrote:
> > On Jul 22, 2009, at 10:38 AM, Robin Anil wrote: > > Dear Grant, Could you post some stats like the number of >> labels and features that you have and the number of unique label,feature >> pair. >> > > labels: history and science > Docs trained on: chunk 1 - 60 generated using the Wikipedia Splitter with > the WikipediaAnalyzer (MAHOUT-146) with chunk size set to 64 > > Where are the <label,feature> values stored? > tf-Idf Folder part-**** > > > Both Naive bayes and Complementary naive bayes use the same data >> except Sigma_j set. >> > > So, why do I need to load it or even calculate it if I am using Bayes? I > think I would like to have the choice. That is, if I plan on using both, > then I can calculate/load both. At a minimum, when classifying with Bayes, > we should not be loading it, even if we did calculate it. > > Could you add some writeup on http://cwiki.apache.org/MAHOUT/bayesian.html > about > the steps that are taken? Also, I've read the CNB paper, but do you have a > reference for the NB part using many of these values? > Sure. I will ASAP > > > But regardless the matrix stored is sparse. I am not >> surprised that with a larger set like that you have taken, memory limit >> was >> crossed. Another thing the number of unique terms in wikipedia is quite >> large. So best choice for you right now is to use the Hbase solution. The >> large matrix is stored easily on the it. I am currently writing the >> Distributed version of Hbase classification for parallelizing. >> >> > HBase isn't an option right now, as it isn't committed and I'm putting > together a demo on current capabilities. > > > > Robin >> >> On Wed, Jul 22, 2009 at 4:53 PM, Grant Ingersoll <[email protected] >> >wrote: >> >> The other thing is, I don't even think Sigma_J is even used for Bayes, >>> only >>> Complementary Bayes. >>> >>> >>> >>> On Jul 22, 2009, at 7:16 AM, Grant Ingersoll wrote: >>> >>> AFAICT, It is loading the Sum Feature Weights, stored in the Sigma_J >>> >>>> directory under the model. For me, this file is 1.04 GB. The values in >>>> this file are loaded into a List of Doubles (which brings with it a >>>> whole >>>> log of auto-boxing, too). It seems like that should fit in memory, >>>> especially since it is the first thing loaded, AFAICT. I have not >>>> looked >>>> yet into the structure of the file itself. >>>> >>>> I guess I will have to dig deeper, this code has changed a lot from when >>>> I >>>> first wrote it as a very simple naive bayes model to one that now >>>> appears to >>>> be weighted by TF-IDF, normalization, etc. and I need to understand it >>>> better. >>>> >>>> On Jul 22, 2009, at 12:26 AM, Ted Dunning wrote: >>>> >>>> This is kind of surprising. It would seem that this model shouldn't >>>> have >>>> >>>>> more than a few doubles per unique term and there should be <half a >>>>> million >>>>> terms. Even with pretty evil data structures, this really shouldn't be >>>>> more >>>>> than a few hundred megs for the model alone. >>>>> >>>>> Sparsity *is* a virtue with these models and I always try to eliminate >>>>> terms >>>>> that might as well have zero value, but that doesn't sound like the >>>>> root >>>>> problem here. >>>>> >>>>> Regarding strings or Writables, strings have the wonderful >>>>> characteristic >>>>> that they cache their hashed value. This means that hash maps are >>>>> nearly >>>>> as >>>>> fast as arrays because you wind up indexing to nearly the right place >>>>> and >>>>> then do a few (or one) integer compare to find the right value. Custom >>>>> data >>>>> types rarely do this and thus wind up slow. >>>>> >>>>> On Tue, Jul 21, 2009 at 7:41 PM, Grant Ingersoll <[email protected] >>>>> >>>>>> wrote: >>>>>> >>>>> >>>>> I trained on a couple of categories (history and science) on quite a >>>>> few >>>>> >>>>>> docs, but now the model is so big, I can't load it, even with almost 3 >>>>>> GB of >>>>>> memory. >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Ted Dunning, CTO >>>>> DeepDyve >>>>> >>>>> >>>> >>>> >>> > -------------------------- > Grant Ingersoll > http://www.lucidimagination.com/ > > Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using > Solr/Lucene: > http://www.lucidimagination.com/search > >
