TfidfVectorizer is giving me an error on some texts that I am importing. 
I am importing them like this:

for location in humanRatedText:
     if location[-3:].lower() == 'txt':
         f = open(dir+location, "r")
         t = f.read()
         texts.append(t)
         f.close()
     if location[-3:].lower() == 'rtf':
         f = open(dir+location, "r")
         doc = Rtf15Reader.read(f)
         t = PlaintextWriter.write(doc).getvalue()
         texts.append(t)
         f.close()

vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, 
stop_words='english')
X = vectorizer.fit_transform(texts)


And it's encountering this error:

Traceback (most recent call last):
   File "11.08.py", line 71, in <module>
     X = vectorizer.fit_transform(texts)
   File 
"C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line
716, in fit_transform
     X = super(TfidfVectorizer, self).fit_transform(raw_documents)
   File 
"C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line
398, in fit_transform
     term_count_current = Counter(analyze(doc))
   File 
"C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line
313, in <lambda>
     tokenize(preprocess(self.decode(doc))), stop_words)
   File 
"C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line
224, in decode
     doc = doc.decode(self.charset, self.charset_error)
   File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
     return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 
1253: invali
d start byte

Could this be a problem with the encoding of the files that I am reading 
in? I can perform other operations on the stings just fine, like 
t.lower(). I have 60 files, and they are all very long, but only some of 
the files seem to do this. I'm not sure which ones exactly, so is there 
a way to solve this programmatically?

Zach

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to