I’ve identified a bug/inconsistency in sklearn.feature_extraction.text.
TfidfVectorizer returns a matrix of type scipy.sparse.csr.csr_matrix; whereas 
CountVectorizer returns scipy.sparse.coo.coo_matrix, which don’t support 
multiple (array) indexing.

Below is a short (silly) example that demonstrates the problem.  It took a 
while to figure out why (in a larger program) I was getting this error.  I am 
using sklearn.cross_validation.StratifiedKFold which returns an index array for 
each fold, and the program broke when I started using CountVectorizer.

Regards,
-Tom

############################################################################
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

CORPUS = """
Experienced writers use a variety of sentences to make their writing
interesting and lively. Too many simple sentences, for example, will sound
choppy and immature.  Too many long sentences will be difficult to read
and hard to understand.  This page contains definitions of simple, compound,
and complex sentences.  It has many simple examples.  The purpose of these
examples is to help the ESL/EFL learner to identify sentence basics including
identification of sentences in the short quizzes that follow.  After that, it
will be possible to analyze more complex sentences varieties.
""".split('. ')

CORPUS = np.asarray(CORPUS)
targets = map(lambda sentence: "of" in sentence, CORPUS)
print "Targets:", targets

TFIDF_vec = TfidfVectorizer()
Count_vec = CountVectorizer()

XT = TFIDF_vec.fit_transform(CORPUS)
XC = Count_vec.fit_transform(CORPUS)
print type(XT)
print type(XC)
sample = np.arange(3)
print XT[sample]
print XC[sample]
############################################################################


------------------------------------------------------------------------------
Own the Future-Intel® Level Up Game Demo Contest 2013
Rise to greatness in Intel's independent game demo contest.
Compete for recognition, cash, and the chance to get your game 
on Steam. $5K grand prize plus 10 genre and skill prizes. 
Submit your demo by 6/6/13. http://p.sf.net/sfu/intel_levelupd2d
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to