Oh, right (duh)...I wasn't thinking clearly about the padding for char_wb.
I'll do some tests with stopword removal for char_wb and submit a PR if it
looks worthwhile.

Cheers,
Fred.


On 19 July 2013 13:27, Olivier Grisel <[email protected]> wrote:

> 2013/7/19 Fred Mailhot <[email protected]>:
> > Hello list...
>
> Hi Fred,
>
> > I'm a huge fan of sklearn and use it daily at work. I was confused by the
> > results of some recent text classification experiments and started
> looking
> > more closely at the vectorization code.
> >
> > I'm wondering about the logic behind:
> >
> > 1) not doing stopword removal for the char_wb analyzer in
> CountVectorizer?
>
> I did not thought about it as stopwords are traditionally used with
> "real" words but I don't have any opposition against using the
> stopwords more consistently. Please feel free to submit a PR with the
> fix along with a new test case.
>
> > (I'm using FeatureUnion to combine vectorizer for word and char ngrams,
> and
> > the char analyzer is getting tripped up on stopword ngrams)
>
> I don't understand what you mean by that any example.
>
> > and
> >
> > 2) padding tokens with a single space in the char_wb analyzer (I'm
> guessing
> > this is to disambiguate ngrams that occur at word boundaries from those
> that
> > don't,
>
> Yes.
>
> > but why not pad with (n-1) spaces?)
>
> Why would you do that? That would (re)create char ngram features that
> are already generated by lower n ngrams.
>
> For instance if ngram_range=(3, 5), if you pad with more than one
> wight space you would generate 5 grams that are already generated by
> the 4-gram, only with a different feature name and thus column: that
> would add redundancy to the features without adding any new signal if
> I am correct.
>
> --
> Olivier
> http://twitter.com/ogrisel - http://github.com/ogrisel
>
>
> ------------------------------------------------------------------------------
> See everything from the browser to the database with AppDynamics
> Get end-to-end visibility with application monitoring from AppDynamics
> Isolate bottlenecks and diagnose root cause in seconds.
> Start your free trial of AppDynamics Pro today!
> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to