The other thing is, I don't even think Sigma_J is even used for Bayes, only Complementary Bayes.

On Jul 22, 2009, at 7:16 AM, Grant Ingersoll wrote:

AFAICT, It is loading the Sum Feature Weights, stored in the Sigma_J directory under the model. For me, this file is 1.04 GB. The values in this file are loaded into a List of Doubles (which brings with it a whole log of auto-boxing, too). It seems like that should fit in memory, especially since it is the first thing loaded, AFAICT. I have not looked yet into the structure of the file itself.

I guess I will have to dig deeper, this code has changed a lot from when I first wrote it as a very simple naive bayes model to one that now appears to be weighted by TF-IDF, normalization, etc. and I need to understand it better.

On Jul 22, 2009, at 12:26 AM, Ted Dunning wrote:

This is kind of surprising. It would seem that this model shouldn't have more than a few doubles per unique term and there should be <half a million terms. Even with pretty evil data structures, this really shouldn't be more
than a few hundred megs for the model alone.

Sparsity *is* a virtue with these models and I always try to eliminate terms that might as well have zero value, but that doesn't sound like the root
problem here.

Regarding strings or Writables, strings have the wonderful characteristic that they cache their hashed value. This means that hash maps are nearly as fast as arrays because you wind up indexing to nearly the right place and then do a few (or one) integer compare to find the right value. Custom data
types rarely do this and thus wind up slow.

On Tue, Jul 21, 2009 at 7:41 PM, Grant Ingersoll <[email protected]>wrote:

I trained on a couple of categories (history and science) on quite a few docs, but now the model is so big, I can't load it, even with almost 3 GB of
memory.




--
Ted Dunning, CTO
DeepDyve


Reply via email to