min_df=2 in the second and min_df=1 in the first.

On Thu, Mar 14, 2013 at 7:19 PM, Ark <[email protected]> wrote:
>
>>
>> This is unexpected. Can you inspect the vocabulary_ on both
>> vectorizers? Try computing their set.intersection, set.difference,
>> set.symmetric_difference (all Python builtins).
>>
>
> In [17]: len(set.symmetric_difference(set(vect13.vocabulary_.keys()),
> set(vect14.vocabulary_.keys())))
> Out[17]: 42529
>
>  I skimmed over the list of the keys and the values seem to be what should be 
> in
>  the document. so I am not sure why exactly I did not have them earlier; will
> continue to analyze if I see any discrepancy.
>
> For clarity I am adding the complete vectorizer:
>
> with scikit 0.14-git:
> Extracting features from dataset using TfidfVectorizer(analyzer=word,
> binary=False, charset=utf-8,
>         charset_error=strict, dtype=<type 'numpy.int64'>, input=content,
>         lowercase=True, max_df=0.5, max_features=None, min_df=1,
>         ngram_range=(1, 2), norm=l2, preprocessor=None, smooth_idf=True,
>         stop_words=english, strip_accents=None, sublinear_tf=True,
>         token_pattern=\b(?!\d)\w\w+\b, tokenizer=None, use_idf=False,
>         vocabulary=None).
>
> with scikit 0.13
> Extracting features from dataset using TfidfVectorizer(analyzer=word,
> binary=False, charset=utf-8,
>         charset_error=strict, dtype=<type 'long'>, input=content,
>         lowercase=True, max_df=0.5, max_features=None, max_n=None,
>         min_df=2, min_n=None, ngram_range=(1, 2), norm=l2,
>         preprocessor=None, smooth_idf=True, stop_words=english,
>         strip_accents=None, sublinear_tf=True,
>         token_pattern=\b(?!\d)\w\w+\b, tokenizer=None, use_idf=False,
>         vocabulary=None)
>
>
> ------------------------------------------------------------------------------
> Everyone hates slow websites. So do we.
> Make your web apps faster with AppDynamics
> Download AppDynamics Lite for free today:
> http://p.sf.net/sfu/appdyn_d2d_mar
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to