Re: [Scikit-learn-general] Analyzer and tokenizer in (Count/TfIdf)Vectorizer

Sebastian Raschka Mon, 07 Dec 2015 14:23:16 -0800

Hi,

I think for most implementations you need to tokenize your text into a 1-gram 
vector prior to stemming/lemmatization, but you can always call " ".join or so 
if you'd like to get the results back as a string.


>>> sentence = 'Stemming is funnier than a bummer says the sushi loving 
>>> computer scientist'
>>> from nltk.stem.porter import PorterStemmer
>>> porter = PorterStemmer()
>>> porter.stem(sentence)
'Stemming is funnier than a bummer says the sushi loving computer scientist'


The example above doesn't work, you'd need to tokenize first:

>>> tokenizer_porter = lambda text: [porter.stem(word) for word in text.split()]
>>> tokenizer_porter(sentence)
['Stem', 'is', 'funnier', 'than', 'a', 'bummer', 'say', 'the', 'sushi', 'love', 
'comput', 'scientist']


And you stem/lemmatize a sentence like so

>>> sentence_porter = lambda text: ' '.join([porter.stem(word) for word in 
>>> text.split()])
>>> sentence_porter(sentence)
'Stem is funnier than a bummer say the sushi love comput scientist'


To use the custom tokenizer in a pipeline:


>>> CountVectorizer(binary=False,                      
stop_words=stop_words,                  
ngram_range=(1,1),
preprocessor=lambda text: re.sub('[^a-zA-Z]', ' ', text.lower()),               
tokenizer=tokenizer_porter)


If I remember correctly, the "processing" order is

preprocessor -> tokenizer -> ngram_range

Thus, you could also pass the "sentence_porter" to the preprocessor and let the 
`tokenizer & ngram_range` do the rest. But that would be computationally more 
inefficient. 

Best,
Sebastian


> On Dec 7, 2015, at 5:00 PM, Andreas Mueller <t3k...@gmail.com> wrote:
> 
> Hi.
> I would say what you are doing with lemmatization is not tokenization but 
> preprocessing. You are not creating tokens, right? The tokens are the char 
> n-grams.
> So what is the problem in using the preprocessing option?
> 
> I'm not super familiar with the NLP lingo, though, so I might be missing 
> something.
> 
> Andy
> 
> 
> On 11/30/2015 04:45 PM, Philip Tully wrote:
>> Hi all,
>> 
>> In the documentation 
>> (http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer)
>>  it is written that when a callable tokenizer is passed into 
>> (Count/TfIdf)Vectorizer, then this "Only applies if analyzer == 'word'" and 
>> I can confirm this in the code at 
>> https://github.com/scikit-learn/scikit-learn/blob/c957249/sklearn/feature_extraction/text.py#L210
>> 
>> But, why is this so? If I want to, for example, perform lemmatization or 
>> some other custom tokenization inside a callable Tokenizer, then pass the 
>> 'char' or 'char_wb' option to the analyzer because I want to do character 
>> grams after that, would this Tokenizer not be called then? Is best practice 
>> to migrate these things into the preprocessor= callable param? Or am I 
>> misunderstanding the documentation
>> 
>> thanks for your help,
>> Philip
>> 
>> 
>> ------------------------------------------------------------------------------
>> Go from Idea to Many App Stores Faster with Intel(R) XDK
>> Give your users amazing mobile app experiences with Intel(R) XDK.
>> Use one codebase in this all-in-one HTML5 development environment.
>> Design, debug & build mobile apps & 2D/3D high-impact games for multiple OSs.
>> 
>> http://pubads.g.doubleclick.net/gampad/clk?id=254741911&iu=/4140
>> 
>> 
>> _______________________________________________
>> Scikit-learn-general mailing list
>> 
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> 
> ------------------------------------------------------------------------------
> Go from Idea to Many App Stores Faster with Intel(R) XDK
> Give your users amazing mobile app experiences with Intel(R) XDK.
> Use one codebase in this all-in-one HTML5 development environment.
> Design, debug & build mobile apps & 2D/3D high-impact games for multiple OSs.
> http://pubads.g.doubleclick.net/gampad/clk?id=254741911&iu=/4140
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


------------------------------------------------------------------------------
Go from Idea to Many App Stores Faster with Intel(R) XDK
Give your users amazing mobile app experiences with Intel(R) XDK.
Use one codebase in this all-in-one HTML5 development environment.
Design, debug & build mobile apps & 2D/3D high-impact games for multiple OSs.
http://pubads.g.doubleclick.net/gampad/clk?id=254741911&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Analyzer and tokenizer in (Count/TfIdf)Vectorizer

Reply via email to