Hi.
I would say what you are doing with lemmatization is not tokenization but preprocessing. You are not creating tokens, right? The tokens are the char n-grams.
So what is the problem in using the preprocessing option?

I'm not super familiar with the NLP lingo, though, so I might be missing something.

Andy


On 11/30/2015 04:45 PM, Philip Tully wrote:
Hi all,

In the documentation (http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer) it is written that when a callable tokenizer is passed into (Count/TfIdf)Vectorizer, then this "Only applies if analyzer=='word'" and I can confirm this in the code at https://github.com/scikit-learn/scikit-learn/blob/c957249/sklearn/feature_extraction/text.py#L210

But, why is this so? If I want to, for example, perform lemmatization or some other custom tokenization inside a callable Tokenizer, then pass the 'char' or 'char_wb' option to the analyzer because I want to do character grams after that, would this Tokenizer not be called then? Is best practice to migrate these things into the preprocessor= callable param? Or am I misunderstanding the documentation

thanks for your help,
Philip


------------------------------------------------------------------------------
Go from Idea to Many App Stores Faster with Intel(R) XDK
Give your users amazing mobile app experiences with Intel(R) XDK.
Use one codebase in this all-in-one HTML5 development environment.
Design, debug & build mobile apps & 2D/3D high-impact games for multiple OSs.
http://pubads.g.doubleclick.net/gampad/clk?id=254741911&iu=/4140


_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Go from Idea to Many App Stores Faster with Intel(R) XDK
Give your users amazing mobile app experiences with Intel(R) XDK.
Use one codebase in this all-in-one HTML5 development environment.
Design, debug & build mobile apps & 2D/3D high-impact games for multiple OSs.
http://pubads.g.doubleclick.net/gampad/clk?id=254741911&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to