TfidfVectorizer is giving me an error on some texts that I am importing.
I am importing them like this:
for location in humanRatedText:
if location[-3:].lower() == 'txt':
f = open(dir+location, "r")
t = f.read()
texts.append(t)
f.close()
if location[-3:].lower() == 'rtf':
f = open(dir+location, "r")
doc = Rtf15Reader.read(f)
t = PlaintextWriter.write(doc).getvalue()
texts.append(t)
f.close()
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
stop_words='english')
X = vectorizer.fit_transform(texts)
And it's encountering this error:
Traceback (most recent call last):
File "11.08.py", line 71, in <module>
X = vectorizer.fit_transform(texts)
File
"C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line
716, in fit_transform
X = super(TfidfVectorizer, self).fit_transform(raw_documents)
File
"C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line
398, in fit_transform
term_count_current = Counter(analyze(doc))
File
"C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line
313, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File
"C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line
224, in decode
doc = doc.decode(self.charset, self.charset_error)
File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position
1253: invali
d start byte
Could this be a problem with the encoding of the files that I am reading
in? I can perform other operations on the stings just fine, like
t.lower(). I have 60 files, and they are all very long, but only some of
the files seem to do this. I'm not sure which ones exactly, so is there
a way to solve this programmatically?
Zach
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general