Wikipedia unigram dictionary is 381MB on disk. Bigram and trigram sizes will
explode like anything. So Vectorizer could be a pass through if reading
vectors(parallely generated) in each of the jobs or on the fly converted if
using the randomizer

The reason I said models be generic is because they can then be read across
classifiers. Like if the classifier does nearest centroid matching like in
NB or using output of K-Means it can be used. Or if a margin trained using
pegasos can be used by any SVM classifier. Thats all


Robin

Reply via email to