The other thing is, I don't even think Sigma_J is even used for Bayes,
only Complementary Bayes.
On Jul 22, 2009, at 7:16 AM, Grant Ingersoll wrote:
AFAICT, It is loading the Sum Feature Weights, stored in the Sigma_J
directory under the model. For me, this file is 1.04 GB. The
values in this file are loaded into a List of Doubles (which brings
with it a whole log of auto-boxing, too). It seems like that should
fit in memory, especially since it is the first thing loaded,
AFAICT. I have not looked yet into the structure of the file itself.
I guess I will have to dig deeper, this code has changed a lot from
when I first wrote it as a very simple naive bayes model to one that
now appears to be weighted by TF-IDF, normalization, etc. and I need
to understand it better.
On Jul 22, 2009, at 12:26 AM, Ted Dunning wrote:
This is kind of surprising. It would seem that this model
shouldn't have
more than a few doubles per unique term and there should be <half a
million
terms. Even with pretty evil data structures, this really
shouldn't be more
than a few hundred megs for the model alone.
Sparsity *is* a virtue with these models and I always try to
eliminate terms
that might as well have zero value, but that doesn't sound like the
root
problem here.
Regarding strings or Writables, strings have the wonderful
characteristic
that they cache their hashed value. This means that hash maps are
nearly as
fast as arrays because you wind up indexing to nearly the right
place and
then do a few (or one) integer compare to find the right value.
Custom data
types rarely do this and thus wind up slow.
On Tue, Jul 21, 2009 at 7:41 PM, Grant Ingersoll
<[email protected]>wrote:
I trained on a couple of categories (history and science) on quite
a few
docs, but now the model is so big, I can't load it, even with
almost 3 GB of
memory.
--
Ted Dunning, CTO
DeepDyve