This is weird. Are you sure it is not the other way around?
The min_df parameter was reset from 2 to 1 afaik, which should give you 
a larger vocabulary
in the git version, not a smaller.


On 03/14/2013 04:11 AM, Ark wrote:
> The vectorized input with the same training data set differs with versions 
> 0.13.1
>   and 0.14-git.
>
> For:
> vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1,2),
>               smooth_idf=True, sublinear_tf=True, max_df=0.5,
>               token_pattern=ur'\b(?!\d)\w\w+\b'))
>
> On fit_transform the shape of the input data
> - with version 0.13.1 is (12440, 1270712)
> - with version 0.14-git is (12440, 484762)
>
> I do not change code and run the same on two different machines in parallel,
>   apart from the number of features the size of the classifier goes from 8.4 
> to
>   26G, but I guess that is due to the number of features. Does this seem 
> correct?
>
>
>
> ------------------------------------------------------------------------------
> Everyone hates slow websites. So do we.
> Make your web apps faster with AppDynamics
> Download AppDynamics Lite for free today:
> http://p.sf.net/sfu/appdyn_d2d_mar
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to