Hi, I think for most implementations you need to tokenize your text into a 1-gram vector prior to stemming/lemmatization, but you can always call " ".join or so if you'd like to get the results back as a string.
>>> sentence = 'Stemming is funnier than a bummer says the sushi loving >>> computer scientist' >>> from nltk.stem.porter import PorterStemmer >>> porter = PorterStemmer() >>> porter.stem(sentence) 'Stemming is funnier than a bummer says the sushi loving computer scientist' The example above doesn't work, you'd need to tokenize first: >>> tokenizer_porter = lambda text: [porter.stem(word) for word in text.split()] >>> tokenizer_porter(sentence) ['Stem', 'is', 'funnier', 'than', 'a', 'bummer', 'say', 'the', 'sushi', 'love', 'comput', 'scientist'] And you stem/lemmatize a sentence like so >>> sentence_porter = lambda text: ' '.join([porter.stem(word) for word in >>> text.split()]) >>> sentence_porter(sentence) 'Stem is funnier than a bummer say the sushi love comput scientist' To use the custom tokenizer in a pipeline: >>> CountVectorizer(binary=False, stop_words=stop_words, ngram_range=(1,1), preprocessor=lambda text: re.sub('[^a-zA-Z]', ' ', text.lower()), tokenizer=tokenizer_porter) If I remember correctly, the "processing" order is preprocessor -> tokenizer -> ngram_range Thus, you could also pass the "sentence_porter" to the preprocessor and let the `tokenizer & ngram_range` do the rest. But that would be computationally more inefficient. Best, Sebastian > On Dec 7, 2015, at 5:00 PM, Andreas Mueller <t3k...@gmail.com> wrote: > > Hi. > I would say what you are doing with lemmatization is not tokenization but > preprocessing. You are not creating tokens, right? The tokens are the char > n-grams. > So what is the problem in using the preprocessing option? > > I'm not super familiar with the NLP lingo, though, so I might be missing > something. > > Andy > > > On 11/30/2015 04:45 PM, Philip Tully wrote: >> Hi all, >> >> In the documentation >> (http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer) >> it is written that when a callable tokenizer is passed into >> (Count/TfIdf)Vectorizer, then this "Only applies if analyzer == 'word'" and >> I can confirm this in the code at >> https://github.com/scikit-learn/scikit-learn/blob/c957249/sklearn/feature_extraction/text.py#L210 >> >> But, why is this so? If I want to, for example, perform lemmatization or >> some other custom tokenization inside a callable Tokenizer, then pass the >> 'char' or 'char_wb' option to the analyzer because I want to do character >> grams after that, would this Tokenizer not be called then? Is best practice >> to migrate these things into the preprocessor= callable param? Or am I >> misunderstanding the documentation >> >> thanks for your help, >> Philip >> >> >> ------------------------------------------------------------------------------ >> Go from Idea to Many App Stores Faster with Intel(R) XDK >> Give your users amazing mobile app experiences with Intel(R) XDK. >> Use one codebase in this all-in-one HTML5 development environment. >> Design, debug & build mobile apps & 2D/3D high-impact games for multiple OSs. >> >> http://pubads.g.doubleclick.net/gampad/clk?id=254741911&iu=/4140 >> >> >> _______________________________________________ >> Scikit-learn-general mailing list >> >> Scikit-learn-general@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > ------------------------------------------------------------------------------ > Go from Idea to Many App Stores Faster with Intel(R) XDK > Give your users amazing mobile app experiences with Intel(R) XDK. > Use one codebase in this all-in-one HTML5 development environment. > Design, debug & build mobile apps & 2D/3D high-impact games for multiple OSs. > http://pubads.g.doubleclick.net/gampad/clk?id=254741911&iu=/4140 > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general ------------------------------------------------------------------------------ Go from Idea to Many App Stores Faster with Intel(R) XDK Give your users amazing mobile app experiences with Intel(R) XDK. Use one codebase in this all-in-one HTML5 development environment. Design, debug & build mobile apps & 2D/3D high-impact games for multiple OSs. http://pubads.g.doubleclick.net/gampad/clk?id=254741911&iu=/4140 _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general