It is unusual to have all that many training sets for document classifiers, but special cases exist where you have hundreds of thousands of training examples. With complementary naive bayes, you effectively have a training set the size of your (negative) corpus (I think).
But, again, model size for Naive Bayesian models should be proportional to the number of terms modeled. Even with lots of data, that shouldn't be all *that* many. You should also be able to trim out hapax to moderate size. On Tue, Jul 21, 2009 at 7:41 PM, Grant Ingersoll <[email protected]>wrote: > Also, what is generally the size of training sets that people use for > something like Naive Bayes (or complementary)? Or, do I suck it up and just > use more memory? -- Ted Dunning, CTO DeepDyve
