In the case of "char_wb" it sounds indeed like a custom tokenizer should be called if given. That would require a different implementation than the current one, however. You might want to file an issue.
Sebastian's suggestion works, but note that scikit-learn's default tokenization is not the same as `text.split()`, in particular in the context of non-alphanumeric characters and single-letter tokens. (However, char_wb does not use the same tokenization as the default word analyzer.) For advanced applications, however, it's usually easiest to just write a custom end-to-end *analyzer*. Inside it, you can reuse the default preprocessor/tokenizer code if you want. HTH, Vlad On Mon, Dec 7, 2015 at 5:00 PM, Andreas Mueller <t3k...@gmail.com> wrote: > Hi. > I would say what you are doing with lemmatization is not tokenization but > preprocessing. You are not creating tokens, right? The tokens are the char > n-grams. > So what is the problem in using the preprocessing option? > > I'm not super familiar with the NLP lingo, though, so I might be missing > something. > > Andy > > > > On 11/30/2015 04:45 PM, Philip Tully wrote: > > Hi all, > > In the documentation > (http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer) > it is written that when a callable tokenizer is passed into > (Count/TfIdf)Vectorizer, then this "Only applies if analyzer == 'word'" and > I can confirm this in the code at > https://github.com/scikit-learn/scikit-learn/blob/c957249/sklearn/feature_extraction/text.py#L210 > > But, why is this so? If I want to, for example, perform lemmatization or > some other custom tokenization inside a callable Tokenizer, then pass the > 'char' or 'char_wb' option to the analyzer because I want to do character > grams after that, would this Tokenizer not be called then? Is best practice > to migrate these things into the preprocessor= callable param? Or am I > misunderstanding the documentation > > thanks for your help, > Philip > > > ------------------------------------------------------------------------------ > Go from Idea to Many App Stores Faster with Intel(R) XDK > Give your users amazing mobile app experiences with Intel(R) XDK. > Use one codebase in this all-in-one HTML5 development environment. > Design, debug & build mobile apps & 2D/3D high-impact games for multiple > OSs. > http://pubads.g.doubleclick.net/gampad/clk?id=254741911&iu=/4140 > > > > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > > > ------------------------------------------------------------------------------ > Go from Idea to Many App Stores Faster with Intel(R) XDK > Give your users amazing mobile app experiences with Intel(R) XDK. > Use one codebase in this all-in-one HTML5 development environment. > Design, debug & build mobile apps & 2D/3D high-impact games for multiple > OSs. > http://pubads.g.doubleclick.net/gampad/clk?id=254741911&iu=/4140 > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > ------------------------------------------------------------------------------ Go from Idea to Many App Stores Faster with Intel(R) XDK Give your users amazing mobile app experiences with Intel(R) XDK. Use one codebase in this all-in-one HTML5 development environment. Design, debug & build mobile apps & 2D/3D high-impact games for multiple OSs. http://pubads.g.doubleclick.net/gampad/clk?id=254741911&iu=/4140 _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general