Hi,
I think for most implementations you need to tokenize your text into a 1-gram
vector prior to stemming/lemmatization, but you can always call " ".join or so
if you'd like to get the results back as a string.
>>> sentence = 'Stemming is funnier than a bummer says the sushi loving
>>> computer scientist'
>>> from nltk.stem.porter import PorterStemmer
>>> porter = PorterStemmer()
>>> porter.stem(sentence)
'Stemming is funnier than a bummer says the sushi loving computer scientist'
The example above doesn't work, you'd need to tokenize first:
>>> tokenizer_porter = lambda text: [porter.stem(word) for word in text.split()]
>>> tokenizer_porter(sentence)
['Stem', 'is', 'funnier', 'than', 'a', 'bummer', 'say', 'the', 'sushi', 'love',
'comput', 'scientist']
And you stem/lemmatize a sentence like so
>>> sentence_porter = lambda text: ' '.join([porter.stem(word) for word in
>>> text.split()])
>>> sentence_porter(sentence)
'Stem is funnier than a bummer say the sushi love comput scientist'
To use the custom tokenizer in a pipeline:
>>> CountVectorizer(binary=False,
stop_words=stop_words,
ngram_range=(1,1),
preprocessor=lambda text: re.sub('[^a-zA-Z]', ' ', text.lower()),
tokenizer=tokenizer_porter)
If I remember correctly, the "processing" order is
preprocessor -> tokenizer -> ngram_range
Thus, you could also pass the "sentence_porter" to the preprocessor and let the
`tokenizer & ngram_range` do the rest. But that would be computationally more
inefficient.
Best,
Sebastian
> On Dec 7, 2015, at 5:00 PM, Andreas Mueller <[email protected]> wrote:
>
> Hi.
> I would say what you are doing with lemmatization is not tokenization but
> preprocessing. You are not creating tokens, right? The tokens are the char
> n-grams.
> So what is the problem in using the preprocessing option?
>
> I'm not super familiar with the NLP lingo, though, so I might be missing
> something.
>
> Andy
>
>
> On 11/30/2015 04:45 PM, Philip Tully wrote:
>> Hi all,
>>
>> In the documentation
>> (http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer)
>> it is written that when a callable tokenizer is passed into
>> (Count/TfIdf)Vectorizer, then this "Only applies if analyzer == 'word'" and
>> I can confirm this in the code at
>> https://github.com/scikit-learn/scikit-learn/blob/c957249/sklearn/feature_extraction/text.py#L210
>>
>> But, why is this so? If I want to, for example, perform lemmatization or
>> some other custom tokenization inside a callable Tokenizer, then pass the
>> 'char' or 'char_wb' option to the analyzer because I want to do character
>> grams after that, would this Tokenizer not be called then? Is best practice
>> to migrate these things into the preprocessor= callable param? Or am I
>> misunderstanding the documentation
>>
>> thanks for your help,
>> Philip
>>
>>
>> ------------------------------------------------------------------------------
>> Go from Idea to Many App Stores Faster with Intel(R) XDK
>> Give your users amazing mobile app experiences with Intel(R) XDK.
>> Use one codebase in this all-in-one HTML5 development environment.
>> Design, debug & build mobile apps & 2D/3D high-impact games for multiple OSs.
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=254741911&iu=/4140
>>
>>
>> _______________________________________________
>> Scikit-learn-general mailing list
>>
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
> ------------------------------------------------------------------------------
> Go from Idea to Many App Stores Faster with Intel(R) XDK
> Give your users amazing mobile app experiences with Intel(R) XDK.
> Use one codebase in this all-in-one HTML5 development environment.
> Design, debug & build mobile apps & 2D/3D high-impact games for multiple OSs.
> http://pubads.g.doubleclick.net/gampad/clk?id=254741911&iu=/4140
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Go from Idea to Many App Stores Faster with Intel(R) XDK
Give your users amazing mobile app experiences with Intel(R) XDK.
Use one codebase in this all-in-one HTML5 development environment.
Design, debug & build mobile apps & 2D/3D high-impact games for multiple OSs.
http://pubads.g.doubleclick.net/gampad/clk?id=254741911&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general