Hi everyone,
I'm trying to investigate how efficient using scikit-learn for classifying
Arabic documents.
I could successfully work with the English tutorial (20 newsgroups dataset) and
once I manipulated the code (by using load_files()) to import Arabic text
instead but I got errors (see below) :(
The difference of my dataset comparing with 20 newsgroups dataset is that, my
training and test data is organized in files in a format (one word per line)
not as a sentences (see sample below).
I usually have got the following error message and I don't know exactly where
is the problem? Would scikit-learn work fine for Arabic letters (using
Unicode)? if not how to do so?
---------------------------------------------
The error message:
---------------------------------------------
Extracting features from the training dataset using a sparse vectorizer
Traceback (most recent call last):
File "document_classification_20newsgroups.py", line 103, in <module>
X_train = vectorizer.fit_transform(data_train.data)
File
"/usr/local/lib/python2.7/dist-packages/scikit_learn-0.10-py2.7-linux-x86_64.egg/sklearn/feature_extraction/text.py",
line 564, in fit_transform
X = self.tc.fit_transform(raw_documents)
File
"/usr/local/lib/python2.7/dist-packages/scikit_learn-0.10-py2.7-linux-x86_64.egg/sklearn/feature_extraction/text.py",
line 378, in fit_transform
return self._term_count_dicts_to_matrix(term_counts_per_doc)
File
"/usr/local/lib/python2.7/dist-packages/scikit_learn-0.10-py2.7-linux-x86_64.egg/sklearn/feature_extraction/text.py",
line 291, in _term_count_dicts_to_matrix
shape = (len(term_count_dicts), max(vocabulary.itervalues()) + 1)
ValueError: max() arg is an empty sequence
-----------------------
Could you please pinpoint me to solve this issue?
==============
Sample of my data:
بررت
ساندي
اتهاماتها
لقرارها
بسحب
الأغنية
الاستوديو
الخاص
استوديو
اخر
اثار
غضب
===============
Regards
~F
------------------------------------------------------------------------------
For Developers, A Lot Can Happen In A Second.
Boundary is the first to Know...and Tell You.
Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
http://p.sf.net/sfu/Boundary-d2dvs2
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general