> 
> This is unexpected. Can you inspect the vocabulary_ on both
> vectorizers? Try computing their set.intersection, set.difference,
> set.symmetric_difference (all Python builtins).
> 

In [17]: len(set.symmetric_difference(set(vect13.vocabulary_.keys()), 
set(vect14.vocabulary_.keys())))
Out[17]: 42529

 I skimmed over the list of the keys and the values seem to be what should be in
 the document. so I am not sure why exactly I did not have them earlier; will 
continue to analyze if I see any discrepancy.

For clarity I am adding the complete vectorizer:

with scikit 0.14-git:
Extracting features from dataset using TfidfVectorizer(analyzer=word, 
binary=False, charset=utf-8,
        charset_error=strict, dtype=<type 'numpy.int64'>, input=content,
        lowercase=True, max_df=0.5, max_features=None, min_df=1,
        ngram_range=(1, 2), norm=l2, preprocessor=None, smooth_idf=True,
        stop_words=english, strip_accents=None, sublinear_tf=True,
        token_pattern=\b(?!\d)\w\w+\b, tokenizer=None, use_idf=False,
        vocabulary=None).

with scikit 0.13
Extracting features from dataset using TfidfVectorizer(analyzer=word, 
binary=False, charset=utf-8,
        charset_error=strict, dtype=<type 'long'>, input=content,
        lowercase=True, max_df=0.5, max_features=None, max_n=None,
        min_df=2, min_n=None, ngram_range=(1, 2), norm=l2,
        preprocessor=None, smooth_idf=True, stop_words=english,
        strip_accents=None, sublinear_tf=True,
        token_pattern=\b(?!\d)\w\w+\b, tokenizer=None, use_idf=False,
        vocabulary=None)


------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to