Hello everyone,
I'm starting to use the scikit-learn to do some NLP. I've already used some
classifiers from NLTK and now I want to try the ones implemented in
scikit-learn.
My data is basically sentences, and I extract features from some words of
those sentences to do some classification task. Most of my features are
nominal: part-of-speech (POS) of the word, word-to-the-left, POS
word-to-the-left, word-to-the-right, POS word-to-the-right, syntactic
relations path from one word to another, etc.
When I made some experiments using the NLTK classifiers (Decision Tree,
Naive Bayes), the feature set was just a dictionary with the corresponding
values for the features: the nominal values. Such as: [ {"postag":"noun",
"wleft":"house", "path":"VPNPNP",...},.... ]. I just had to pass this to
the classifiers and they did their job.
Now, I want try out the classifiers in the scikit-learn package. As I
understand, this type of feature sets are not acceptable for the algorithms
implemented in sklearn, since all feature values must be numeric, and they
have to be in an array or matrix. Therefore, I transformed the "original"
feature sets using the DictVectorizer class. However, when I pass this
transformed vectors, I get the following errors:
# With DecisionTreeClass
Traceback (most recent call last):
.....
self.classifier.fit(train_argcands_feats,new_train_argcands_target)
File "/usr/local/lib/python2.7/dist-packages/sklearn/tree/tree.py", line
458, in fit
X = np.asarray(X, dtype=DTYPE, order='F')
File "/usr/local/lib/python2.7/dist-packages/numpy/core/numeric.py", line
235, in asarray
return array(a, dtype, copy=False, order=order)
TypeError: float() argument must be a string or a number
# With GaussianNB
Traceback (most recent call last):
....
self.classifier.fit(train_argcands_feats,new_train_argcands_target)
File "/usr/local/lib/python2.7/dist-packages/sklearn/naive_bayes.py",
line 156, in fit
n_samples, n_features = X.shape
ValueError: need more than 0 values to unpack
I get these errors when I just use DictVectorizer(). However, if I
use DictVectorizer(sparse=False), I get the errors even before the code
gets to the training part:
Traceback (most recent call last):
train_argcands_feats =
self.feat_vectorizer.fit_transform(train_argcands_feats)
File
"/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/dict_vectorizer.py",
line 123, in fit_transform
return self.transform(X)
File
"/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/dict_vectorizer.py",
line 212, in transform
Xa = np.zeros((len(X), len(vocab)), dtype=dtype)
ValueError: array is too big.
What could be causing this? What am I doing wrong? Thanks in advance for
all the help you could give me.
Cheers,
Fernando
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general