Op 16 april 2012 13:57 heeft Fahd S. Alotaibi
<[email protected]> het volgende geschreven:
> I usually have got the following error message and I don't know exactly
> where is the problem? Would scikit-learn work fine for Arabic letters (using
> Unicode)? if not how to do so?
The problem is that Vectorizer by default uses a class called
RomanPreprocessor that strips out any characters that it cannot
translate to ASCII -- meaning all your Arabic letters are simply
ignored and you end up with an empty vocabulary. I admit that the
error message could have been a bit friendlier...
The solution is to implement your own processor class, e.g.
class NullPreprocessor(object):
@staticmethod
def preprocess(text):
return text
then pass one of those to the Vectorizer as
from sklearn.feature_extraction.text import Vectorizer, WordNGramAnalyzer
vectorizer =
Vectorizer(analyzer=WordNGramAnalyzer(preprocessor=NullPreprocessor())
and then fit.
HTH,
--
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam
------------------------------------------------------------------------------
For Developers, A Lot Can Happen In A Second.
Boundary is the first to Know...and Tell You.
Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
http://p.sf.net/sfu/Boundary-d2dvs2
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general