Dear Grant, Could you post some stats like the number of labels and features that you have and the number of unique label,feature pair. Both Naive bayes and Complementary naive bayes use the same data except Sigma_j set. But regardless the matrix stored is sparse. I am not surprised that with a larger set like that you have taken, memory limit was crossed. Another thing the number of unique terms in wikipedia is quite large. So best choice for you right now is to use the Hbase solution. The large matrix is stored easily on the it. I am currently writing the Distributed version of Hbase classification for parallelizing.
Robin On Wed, Jul 22, 2009 at 4:53 PM, Grant Ingersoll <[email protected]>wrote: > The other thing is, I don't even think Sigma_J is even used for Bayes, only > Complementary Bayes. > > > > On Jul 22, 2009, at 7:16 AM, Grant Ingersoll wrote: > > AFAICT, It is loading the Sum Feature Weights, stored in the Sigma_J >> directory under the model. For me, this file is 1.04 GB. The values in >> this file are loaded into a List of Doubles (which brings with it a whole >> log of auto-boxing, too). It seems like that should fit in memory, >> especially since it is the first thing loaded, AFAICT. I have not looked >> yet into the structure of the file itself. >> >> I guess I will have to dig deeper, this code has changed a lot from when I >> first wrote it as a very simple naive bayes model to one that now appears to >> be weighted by TF-IDF, normalization, etc. and I need to understand it >> better. >> >> On Jul 22, 2009, at 12:26 AM, Ted Dunning wrote: >> >> This is kind of surprising. It would seem that this model shouldn't have >>> more than a few doubles per unique term and there should be <half a >>> million >>> terms. Even with pretty evil data structures, this really shouldn't be >>> more >>> than a few hundred megs for the model alone. >>> >>> Sparsity *is* a virtue with these models and I always try to eliminate >>> terms >>> that might as well have zero value, but that doesn't sound like the root >>> problem here. >>> >>> Regarding strings or Writables, strings have the wonderful characteristic >>> that they cache their hashed value. This means that hash maps are nearly >>> as >>> fast as arrays because you wind up indexing to nearly the right place and >>> then do a few (or one) integer compare to find the right value. Custom >>> data >>> types rarely do this and thus wind up slow. >>> >>> On Tue, Jul 21, 2009 at 7:41 PM, Grant Ingersoll <[email protected] >>> >wrote: >>> >>> I trained on a couple of categories (history and science) on quite a few >>>> docs, but now the model is so big, I can't load it, even with almost 3 >>>> GB of >>>> memory. >>>> >>> >>> >>> >>> >>> -- >>> Ted Dunning, CTO >>> DeepDyve >>> >> >> >
