This is kind of surprising.  It would seem that this model shouldn't have
more than a few doubles per unique term and there should be <half a million
terms.  Even with pretty evil data structures, this really shouldn't be more
than a few hundred megs for the model alone.

Sparsity *is* a virtue with these models and I always try to eliminate terms
that might as well have zero value, but that doesn't sound like the root
problem here.

Regarding strings or Writables, strings have the wonderful characteristic
that they cache their hashed value.  This means that hash maps are nearly as
fast as arrays because you wind up indexing to nearly the right place and
then do a few (or one) integer compare to find the right value.  Custom data
types rarely do this and thus wind up slow.

On Tue, Jul 21, 2009 at 7:41 PM, Grant Ingersoll <[email protected]>wrote:

>  I trained on a couple of categories (history and science) on quite a few
> docs, but now the model is so big, I can't load it, even with almost 3 GB of
> memory.




-- 
Ted Dunning, CTO
DeepDyve

Reply via email to