2013/10/2 Minkoo <[email protected]>:
> I have a question on using HashingVectorizer with TFxIDF. Currently, I'm
> trying to build a model to predict classes for large set of documents.
>
> On the other hand TfIdfVectorizer does not support processing documents in
> batch. It needs to load the entire feature vector into the memory.

That's because it needs tf-idf needs two passes over the dataset,
while HashingVectorizer is intended as a memoryless, single-pass
method.

> But I couldn't find out a good way to use IDF table when creating feature
> vector in HashingVectorizer. 'normalize' seems to be the point to extend the
> HashingVectorizer to use the IDF table, but it's currently tied to a
> function named 'normalize'.

Normalization has little to do with tf-idf, it just means that the
document vectors are normalized so that cosine similarities work and
learners don't get too extreme values as input (note that cosine
similarity and tf-idf are orthogonal concepts, even though IR
textbooks commonly treat them as a pair). The way to combine HV and
Tfidf is

hashing = HashingVectorizer(non_negative=True, norm=None)
tfidf = TfidfTransformer()
hashing_tfidf = Pipeline([("hashing", hashing), ("tidf", tfidf)])

------------------------------------------------------------------------------
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register >
http://pubads.g.doubleclick.net/gampad/clk?id=60134791&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to