[Scikit-learn-general] Analyzer and tokenizer in (Count/TfIdf)Vectorizer

Philip Tully Mon, 30 Nov 2015 13:46:53 -0800

Hi all,

In the documentation (
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer)
it is written that when a callable tokenizer is passed
into (Count/TfIdf)Vectorizer, then this "Only applies if analyzer == 'word'
" and I can confirm this in the code at
https://github.com/scikit-learn/scikit-learn/blob/c957249/sklearn/feature_extraction/text.py#L210


But, why is this so? If I want to, for example, perform lemmatization or
some other custom tokenization inside a callable Tokenizer, then pass the
'char' or 'char_wb' option to the analyzer because I want to do character
grams after that, would this Tokenizer not be called then? Is best practice
to migrate these things into the preprocessor= callable param? Or am I
misunderstanding the documentation

thanks for your help,
Philip

------------------------------------------------------------------------------
Go from Idea to Many App Stores Faster with Intel(R) XDK
Give your users amazing mobile app experiences with Intel(R) XDK.
Use one codebase in this all-in-one HTML5 development environment.
Design, debug & build mobile apps & 2D/3D high-impact games for multiple OSs.
http://pubads.g.doubleclick.net/gampad/clk?id=254741911&iu=/4140

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

[Scikit-learn-general] Analyzer and tokenizer in (Count/TfIdf)Vectorizer

Reply via email to