[Scikit-learn-general] Problem with classification algorithms

Fernando Alva Manchego Thu, 23 Aug 2012 13:20:41 -0700

Hello everyone,
I'm starting to use the scikit-learn to do some NLP. I've already used some
classifiers from NLTK and now I want to try the ones implemented in
scikit-learn.


My data is basically sentences, and I extract features from some words of
those sentences to do some classification task. Most of my features are
nominal: part-of-speech (POS) of the word, word-to-the-left, POS
word-to-the-left, word-to-the-right, POS word-to-the-right, syntactic
relations path from one word to another, etc.

When I made some experiments using the NLTK classifiers (Decision Tree,
Naive Bayes), the feature set was just a dictionary with the corresponding
values for the features: the nominal values. Such as: [ {"postag":"noun",
"wleft":"house", "path":"VPNPNP",...},.... ]. I just had to pass this to
the classifiers and they did their job.

Now, I want try out the classifiers in the scikit-learn package. As I
understand, this type of feature sets are not acceptable for the algorithms
implemented in sklearn, since all feature values must be numeric, and they
have to be in an array or matrix. Therefore, I transformed the "original"
feature sets using the DictVectorizer class. However, when I pass this
transformed vectors, I get the following errors:

# With DecisionTreeClass

Traceback (most recent call last):
.....
self.classifier.fit(train_argcands_feats,new_train_argcands_target)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/tree/tree.py", line
458, in fit
    X = np.asarray(X, dtype=DTYPE, order='F')
  File "/usr/local/lib/python2.7/dist-packages/numpy/core/numeric.py", line
235, in asarray
    return array(a, dtype, copy=False, order=order)
TypeError: float() argument must be a string or a number


# With GaussianNB

Traceback (most recent call last):
....
self.classifier.fit(train_argcands_feats,new_train_argcands_target)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/naive_bayes.py",
line 156, in fit
    n_samples, n_features = X.shape
ValueError: need more than 0 values to unpack


I get these errors when I just use DictVectorizer(). However, if I
use DictVectorizer(sparse=False), I get the errors even before the code
gets to the training part:

Traceback (most recent call last):
train_argcands_feats =
self.feat_vectorizer.fit_transform(train_argcands_feats)
  File
"/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/dict_vectorizer.py",
line 123, in fit_transform
    return self.transform(X)
  File
"/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/dict_vectorizer.py",
line 212, in transform
    Xa = np.zeros((len(X), len(vocab)), dtype=dtype)
ValueError: array is too big.

What could be causing this? What am I doing wrong? Thanks in advance for
all the help you could give me.

Cheers,
Fernando

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

[Scikit-learn-general] Problem with classification algorithms

Reply via email to