I have been doing some work on classification (of Wikipedia) and am
having a hard time actually running the Test classifier. I trained on
a couple of categories (history and science) on quite a few docs, but
now the model is so big, I can't load it, even with almost 3 GB of
memory. I'm just wondering what people would recommend here. One
thought is that our code is really String/Text based. I also notice
we start with default values for the maps used to load the models,
which probably means we are resizing a lot. Should we use Strings or
would it be better to have some custom Writables and then keep track
of the actual terms separately kind of like the doc clustering does as
well as tracking the size so we can avoid resizing?
Also, what is generally the size of training sets that people use for
something like Naive Bayes (or complementary)? Or, do I suck it up
and just use more memory?
Thoughts?
-Grant
- Getting Started with Classification Grant Ingersoll
-