Wikipedia unigram dictionary is 381MB on disk. Bigram and trigram sizes will explode like anything. So Vectorizer could be a pass through if reading vectors(parallely generated) in each of the jobs or on the fly converted if using the randomizer
The reason I said models be generic is because they can then be read across classifiers. Like if the classifier does nearest centroid matching like in NB or using output of K-Means it can be used. Or if a margin trained using pegasos can be used by any SVM classifier. Thats all Robin
