I’ve identified a bug/inconsistency in sklearn.feature_extraction.text.
TfidfVectorizer returns a matrix of type scipy.sparse.csr.csr_matrix; whereas
CountVectorizer returns scipy.sparse.coo.coo_matrix, which don’t support
multiple (array) indexing.
Below is a short (silly) example that demonstrates the problem. It took a
while to figure out why (in a larger program) I was getting this error. I am
using sklearn.cross_validation.StratifiedKFold which returns an index array for
each fold, and the program broke when I started using CountVectorizer.
Regards,
-Tom
############################################################################
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
CORPUS = """
Experienced writers use a variety of sentences to make their writing
interesting and lively. Too many simple sentences, for example, will sound
choppy and immature. Too many long sentences will be difficult to read
and hard to understand. This page contains definitions of simple, compound,
and complex sentences. It has many simple examples. The purpose of these
examples is to help the ESL/EFL learner to identify sentence basics including
identification of sentences in the short quizzes that follow. After that, it
will be possible to analyze more complex sentences varieties.
""".split('. ')
CORPUS = np.asarray(CORPUS)
targets = map(lambda sentence: "of" in sentence, CORPUS)
print "Targets:", targets
TFIDF_vec = TfidfVectorizer()
Count_vec = CountVectorizer()
XT = TFIDF_vec.fit_transform(CORPUS)
XC = Count_vec.fit_transform(CORPUS)
print type(XT)
print type(XC)
sample = np.arange(3)
print XT[sample]
print XC[sample]
############################################################################
------------------------------------------------------------------------------
Own the Future-Intel® Level Up Game Demo Contest 2013
Rise to greatness in Intel's independent game demo contest.
Compete for recognition, cash, and the chance to get your game
on Steam. $5K grand prize plus 10 genre and skill prizes.
Submit your demo by 6/6/13. http://p.sf.net/sfu/intel_levelupd2d
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general