Re: Getting Started with Classification

Grant Ingersoll Wed, 22 Jul 2009 04:23:57 -0700

The other thing is, I don't even think Sigma_J is even used for Bayes,only Complementary Bayes.


On Jul 22, 2009, at 7:16 AM, Grant Ingersoll wrote:

AFAICT, It is loading the Sum Feature Weights, stored in the Sigma_Jdirectory under the model. For me, this file is 1.04 GB. Thevalues in this file are loaded into a List of Doubles (which bringswith it a whole log of auto-boxing, too). It seems like that shouldfit in memory, especially since it is the first thing loaded,AFAICT. I have not looked yet into the structure of the file itself.
I guess I will have to dig deeper, this code has changed a lot fromwhen I first wrote it as a very simple naive bayes model to one thatnow appears to be weighted by TF-IDF, normalization, etc. and I needto understand it better.
On Jul 22, 2009, at 12:26 AM, Ted Dunning wrote:
This is kind of surprising. It would seem that this modelshouldn't havemore than a few doubles per unique term and there should be <half amillionterms. Even with pretty evil data structures, this reallyshouldn't be more
than a few hundred megs for the model alone.
Sparsity *is* a virtue with these models and I always try toeliminate termsthat might as well have zero value, but that doesn't sound like theroot
problem here.
Regarding strings or Writables, strings have the wonderfulcharacteristicthat they cache their hashed value. This means that hash maps arenearly asfast as arrays because you wind up indexing to nearly the rightplace andthen do a few (or one) integer compare to find the right value.Custom data
types rarely do this and thus wind up slow.
On Tue, Jul 21, 2009 at 7:41 PM, Grant Ingersoll<[email protected]>wrote:
I trained on a couple of categories (history and science) on quitea fewdocs, but now the model is so big, I can't load it, even withalmost 3 GB of
memory.
--
Ted Dunning, CTO
DeepDyve

Re: Getting Started with Classification

Reply via email to