I have been doing some work on classification (of Wikipedia) and am having a hard time actually running the Test classifier. I trained on a couple of categories (history and science) on quite a few docs, but now the model is so big, I can't load it, even with almost 3 GB of memory. I'm just wondering what people would recommend here. One thought is that our code is really String/Text based. I also notice we start with default values for the maps used to load the models, which probably means we are resizing a lot. Should we use Strings or would it be better to have some custom Writables and then keep track of the actual terms separately kind of like the doc clustering does as well as tracking the size so we can avoid resizing?

Also, what is generally the size of training sets that people use for something like Naive Bayes (or complementary)? Or, do I suck it up and just use more memory?

Thoughts?

-Grant

Reply via email to