Hi everyone,

I'm trying to investigate how efficient using scikit-learn for classifying 
Arabic documents.

I could successfully work with the English tutorial (20 newsgroups dataset) and 
once I manipulated the code (by using load_files()) to import Arabic text 
instead but I got errors (see below) :(

The difference of my dataset comparing with 20 newsgroups dataset is that, my 
training and test data is organized in files in a format (one word per line) 
not as a sentences (see sample below).

I usually have got the following error message and I don't know exactly where 
is the problem? Would scikit-learn work fine for Arabic letters (using 
Unicode)? if not how to do so?

---------------------------------------------
The error message:
---------------------------------------------
Extracting features from the training dataset using a sparse vectorizer
Traceback (most recent call last):
  File "document_classification_20newsgroups.py", line 103, in <module>
    X_train = vectorizer.fit_transform(data_train.data)
  File 
"/usr/local/lib/python2.7/dist-packages/scikit_learn-0.10-py2.7-linux-x86_64.egg/sklearn/feature_extraction/text.py",
 line 564, in fit_transform
    X = self.tc.fit_transform(raw_documents)
  File 
"/usr/local/lib/python2.7/dist-packages/scikit_learn-0.10-py2.7-linux-x86_64.egg/sklearn/feature_extraction/text.py",
 line 378, in fit_transform
    return self._term_count_dicts_to_matrix(term_counts_per_doc)
  File 
"/usr/local/lib/python2.7/dist-packages/scikit_learn-0.10-py2.7-linux-x86_64.egg/sklearn/feature_extraction/text.py",
 line 291, in _term_count_dicts_to_matrix
    shape = (len(term_count_dicts), max(vocabulary.itervalues()) + 1)
ValueError: max() arg is an empty sequence
-----------------------

Could you please pinpoint me to solve this issue?



==============
Sample of my data:
بررت

ساندي

اتهاماتها

لقرارها

بسحب

الأغنية

الاستوديو

الخاص

استوديو

اخر

اثار

غضب
===============

Regards
~F
                                          
------------------------------------------------------------------------------
For Developers, A Lot Can Happen In A Second.
Boundary is the first to Know...and Tell You.
Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
http://p.sf.net/sfu/Boundary-d2dvs2
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to