Re: [Scikit-learn-general] Vectorizing input

Lars Buitinck Thu, 14 Mar 2013 04:28:50 -0700

2013/3/14 Ark <[email protected]>:
> For:
> vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1,2),
>              smooth_idf=True, sublinear_tf=True, max_df=0.5,
>              token_pattern=ur'\b(?!\d)\w\w+\b'))
>
> On fit_transform the shape of the input data
> - with version 0.13.1 is (12440, 1270712)
> - with version 0.14-git is (12440, 484762)
>
> I do not change code and run the same on two different machines in parallel,
>  apart from the number of features the size of the classifier goes from 8.4 to
>  26G, but I guess that is due to the number of features. Does this seem 
> correct?


This is unexpected. Can you inspect the vocabulary_ on both
vectorizers? Try computing their set.intersection, set.difference,
set.symmetric_difference (all Python builtins).

-- 
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Vectorizing input

Reply via email to