Hi.
I would say what you are doing with lemmatization is not tokenization
but preprocessing. You are not creating tokens, right? The tokens are
the char n-grams.
So what is the problem in using the preprocessing option?
I'm not super familiar with the NLP lingo, though, so I might be missing
something.
Andy
On 11/30/2015 04:45 PM, Philip Tully wrote:
Hi all,
In the documentation
(http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer)
it is written that when a callable tokenizer is passed
into (Count/TfIdf)Vectorizer, then this "Only applies if
analyzer=='word'" and I can confirm this in the code at
https://github.com/scikit-learn/scikit-learn/blob/c957249/sklearn/feature_extraction/text.py#L210
But, why is this so? If I want to, for example, perform lemmatization
or some other custom tokenization inside a callable Tokenizer, then
pass the 'char' or 'char_wb' option to the analyzer because I want to
do character grams after that, would this Tokenizer not be called
then? Is best practice to migrate these things into the preprocessor=
callable param? Or am I misunderstanding the documentation
thanks for your help,
Philip
------------------------------------------------------------------------------
Go from Idea to Many App Stores Faster with Intel(R) XDK
Give your users amazing mobile app experiences with Intel(R) XDK.
Use one codebase in this all-in-one HTML5 development environment.
Design, debug & build mobile apps & 2D/3D high-impact games for multiple OSs.
http://pubads.g.doubleclick.net/gampad/clk?id=254741911&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Go from Idea to Many App Stores Faster with Intel(R) XDK
Give your users amazing mobile app experiences with Intel(R) XDK.
Use one codebase in this all-in-one HTML5 development environment.
Design, debug & build mobile apps & 2D/3D high-impact games for multiple OSs.
http://pubads.g.doubleclick.net/gampad/clk?id=254741911&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general