Le 16 avril 2012 05:28, Lars Buitinck <[email protected]> a écrit : > Op 16 april 2012 13:57 heeft Fahd S. Alotaibi > <[email protected]> het volgende geschreven: >> I usually have got the following error message and I don't know exactly >> where is the problem? Would scikit-learn work fine for Arabic letters (using >> Unicode)? if not how to do so? > > The problem is that Vectorizer by default uses a class called > RomanPreprocessor that strips out any characters that it cannot > translate to ASCII -- meaning all your Arabic letters are simply > ignored and you end up with an empty vocabulary. I admit that the > error message could have been a bit friendlier... > > The solution is to implement your own processor class, e.g. > > class NullPreprocessor(object): > @staticmethod > def preprocess(text): > return text > > then pass one of those to the Vectorizer as > > from sklearn.feature_extraction.text import Vectorizer, WordNGramAnalyzer > vectorizer = > Vectorizer(analyzer=WordNGramAnalyzer(preprocessor=NullPreprocessor())
Alternatively you can use the master branch of scikit-learn that as a much simpler API documented here: http://scikit-learn.org/dev/modules/feature_extraction.html#text-feature-extraction Don't forget to pass the charset to the vectorizer if your data is not encoded in UTF-8. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ For Developers, A Lot Can Happen In A Second. Boundary is the first to Know...and Tell You. Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! http://p.sf.net/sfu/Boundary-d2dvs2 _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
