Hi there,
I looked into WordNGramAnalyzer in feature_extraction/text.py.
It occured to me that in case of nGram n>1, 'handle token n-grams' happends
before 'handle stop words', as shown in following snippet:
# handle token n-grams
if self.min_n != 1 or self.max_n != 1:
original_tokens = tokens
tokens = []
n_original_tokens = len(original_tokens)
for n in xrange(self.min_n,
min(self.max_n + 1, n_original_tokens + 1)):
for i in xrange(n_original_tokens - n + 1):
tokens.append(u" ".join(original_tokens[i: i + n]))
# handle stop words
if self.stop_words is not None:
tokens = [w for w in tokens if w not in self.stop_words]
At least it is strange to me that, especially when I define my own
stopwords, these stopwords should not appear in nGram either.
Is there any special consideration for such implementation? Thanks.
------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure
contains a definitive record of customers, application performance,
security threats, fraudulent activity, and more. Splunk takes this
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general