Le 16 avril 2012 05:28, Lars Buitinck <[email protected]> a écrit :
> Op 16 april 2012 13:57 heeft Fahd S. Alotaibi
> <[email protected]> het volgende geschreven:
>> I usually have got the following error message and I don't know exactly
>> where is the problem? Would scikit-learn work fine for Arabic letters (using
>> Unicode)? if not how to do so?
>
> The problem is that Vectorizer by default uses a class called
> RomanPreprocessor that strips out any characters that it cannot
> translate to ASCII -- meaning all your Arabic letters are simply
> ignored and you end up with an empty vocabulary. I admit that the
> error message could have been a bit friendlier...
>
> The solution is to implement your own processor class, e.g.
>
> class NullPreprocessor(object):
>    @staticmethod
>    def preprocess(text):
>        return text
>
> then pass one of those to the Vectorizer as
>
> from sklearn.feature_extraction.text import Vectorizer, WordNGramAnalyzer
> vectorizer = 
> Vectorizer(analyzer=WordNGramAnalyzer(preprocessor=NullPreprocessor())

Alternatively you can use the master branch of scikit-learn that as a
much simpler API documented here:

  
http://scikit-learn.org/dev/modules/feature_extraction.html#text-feature-extraction

Don't forget to pass the charset to the vectorizer if your data is not
encoded in UTF-8.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
For Developers, A Lot Can Happen In A Second.
Boundary is the first to Know...and Tell You.
Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
http://p.sf.net/sfu/Boundary-d2dvs2
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to