Re: [Scikit-learn-general] Document classifier with Arabic textual data

Lars Buitinck Mon, 16 Apr 2012 05:28:47 -0700

Op 16 april 2012 13:57 heeft Fahd S. Alotaibi
<[email protected]> het volgende geschreven:
> I usually have got the following error message and I don't know exactly
> where is the problem? Would scikit-learn work fine for Arabic letters (using
> Unicode)? if not how to do so?


The problem is that Vectorizer by default uses a class called
RomanPreprocessor that strips out any characters that it cannot
translate to ASCII -- meaning all your Arabic letters are simply
ignored and you end up with an empty vocabulary. I admit that the
error message could have been a bit friendlier...

The solution is to implement your own processor class, e.g.

class NullPreprocessor(object):
    @staticmethod
    def preprocess(text):
        return text

then pass one of those to the Vectorizer as

from sklearn.feature_extraction.text import Vectorizer, WordNGramAnalyzer
vectorizer = 
Vectorizer(analyzer=WordNGramAnalyzer(preprocessor=NullPreprocessor())

and then fit.

HTH,

-- 
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam

------------------------------------------------------------------------------
For Developers, A Lot Can Happen In A Second.
Boundary is the first to Know...and Tell You.
Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
http://p.sf.net/sfu/Boundary-d2dvs2
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Document classifier with Arabic textual data

Reply via email to