2012/7/18 Peter Prettenhofer <[email protected]>: > 2012/7/18 Philipp Singer <[email protected]>: >> Yes, I am currently trying around with tf only, but the vocabulary is >> still dependen on the corpus. > > I would fit the vectorizor on both datasets (such that the vocabulary > covers the union) and then fit the IDF transformers on each dataset > individually. > > Disclaimer: I hardly use sklearn's text utilities
You could determine the vocabulary, then pass it to CountVectorizer or TfidfVectorizer in the constructor. Also, I have a PR for a hashing vectorizer that does not need a vocabulary at https://github.com/scikit-learn/scikit-learn/pull/909. It's not ready for merging yet (and I hardly have time to work on it), but it does work. -- Lars Buitinck Scientific programmer, ILPS University of Amsterdam ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
