[Scikit-learn-general] Order of processes in WordNGramAnalyzer

SK Sn Tue, 22 Nov 2011 12:34:10 -0800

Hi there,

I looked into WordNGramAnalyzer in feature_extraction/text.py.


It occured to me that in case of nGram n>1, 'handle token n-grams' happends
before 'handle stop words', as shown in following snippet:


        # handle token n-grams
        if self.min_n != 1 or self.max_n != 1:
            original_tokens = tokens
            tokens = []
            n_original_tokens = len(original_tokens)
            for n in xrange(self.min_n,
                            min(self.max_n + 1, n_original_tokens + 1)):
                for i in xrange(n_original_tokens - n + 1):
                    tokens.append(u" ".join(original_tokens[i: i + n]))

        # handle stop words
        if self.stop_words is not None:
            tokens = [w for w in tokens if w not in self.stop_words]


At least it is strange to me that, especially when I define my own
stopwords, these stopwords should not appear in nGram either.
Is there any special consideration for such implementation? Thanks.

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure 
contains a definitive record of customers, application performance, 
security threats, fraudulent activity, and more. Splunk takes this 
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

[Scikit-learn-general] Order of processes in WordNGramAnalyzer

Reply via email to