Re: [Scikit-learn-general] Analyzer and tokenizer in (Count/TfIdf)Vectorizer

Vlad Niculae Mon, 07 Dec 2015 14:27:23 -0800

In the case of "char_wb" it sounds indeed like a custom tokenizer
should be called if given. That would require a different
implementation than the current one, however. You might want to file
an issue.


Sebastian's suggestion works, but note that scikit-learn's default
tokenization is not the same as `text.split()`, in particular in the
context of non-alphanumeric characters and single-letter tokens.
(However, char_wb does not use the same tokenization as the default
word analyzer.)

For advanced applications, however, it's usually easiest to just write
a custom end-to-end *analyzer*. Inside it, you can reuse the default
preprocessor/tokenizer code if you want.

HTH,
Vlad

On Mon, Dec 7, 2015 at 5:00 PM, Andreas Mueller <t3k...@gmail.com> wrote:
> Hi.
> I would say what you are doing with lemmatization is not tokenization but
> preprocessing. You are not creating tokens, right? The tokens are the char
> n-grams.
> So what is the problem in using the preprocessing option?
>
> I'm not super familiar with the NLP lingo, though, so I might be missing
> something.
>
> Andy
>
>
>
> On 11/30/2015 04:45 PM, Philip Tully wrote:
>
> Hi all,
>
> In the documentation
> (http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer)
> it is written that when a callable tokenizer is passed into
> (Count/TfIdf)Vectorizer, then this "Only applies if analyzer == 'word'" and
> I can confirm this in the code at
> https://github.com/scikit-learn/scikit-learn/blob/c957249/sklearn/feature_extraction/text.py#L210
>
> But, why is this so? If I want to, for example, perform lemmatization or
> some other custom tokenization inside a callable Tokenizer, then pass the
> 'char' or 'char_wb' option to the analyzer because I want to do character
> grams after that, would this Tokenizer not be called then? Is best practice
> to migrate these things into the preprocessor= callable param? Or am I
> misunderstanding the documentation
>
> thanks for your help,
> Philip
>
>
> ------------------------------------------------------------------------------
> Go from Idea to Many App Stores Faster with Intel(R) XDK
> Give your users amazing mobile app experiences with Intel(R) XDK.
> Use one codebase in this all-in-one HTML5 development environment.
> Design, debug & build mobile apps & 2D/3D high-impact games for multiple
> OSs.
> http://pubads.g.doubleclick.net/gampad/clk?id=254741911&iu=/4140
>
>
>
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
> ------------------------------------------------------------------------------
> Go from Idea to Many App Stores Faster with Intel(R) XDK
> Give your users amazing mobile app experiences with Intel(R) XDK.
> Use one codebase in this all-in-one HTML5 development environment.
> Design, debug & build mobile apps & 2D/3D high-impact games for multiple
> OSs.
> http://pubads.g.doubleclick.net/gampad/clk?id=254741911&iu=/4140
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>

------------------------------------------------------------------------------
Go from Idea to Many App Stores Faster with Intel(R) XDK
Give your users amazing mobile app experiences with Intel(R) XDK.
Use one codebase in this all-in-one HTML5 development environment.
Design, debug & build mobile apps & 2D/3D high-impact games for multiple OSs.
http://pubads.g.doubleclick.net/gampad/clk?id=254741911&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Analyzer and tokenizer in (Count/TfIdf)Vectorizer

Reply via email to