2012/7/20 Andreas Müller <[email protected]>:
> Hi Sicco.
Indeed, hi, and nice to see you've picked scikit-learn :)
> This is desired behavior.
Then again, we could introduce a min_classes parameter to determine
how many classes should be returned at least. This is commonly what
you want when predicting multiple tags (think StackOverflow questions,
where at least one tag is required).
> If you want to always get a label, you could have a look at the
> decision_function
> and just predict the label with the highest score if no label was predicted.
In some more detail, you can find out which class gets the highest
score for a sample vector x using
clf.label_binarizer_.classes_[numpy.argmax([e.decision_function(x)
for e in clf.estimators_])]
This is arguably a hack; the OvR estimator is a bit rough around the
edges. It doesn't play well with the Pipeline either, since you have
to vectorize the document yourself. Without a Pipeline, the training
procedure would be
vect = TfidfVectorizer() # or Vectorizer in older versions; this
combines CountVectorizer and TfidfTransformer
clf = OneVsRestClassifier(LinearSVC())
X = vect.fit_transform(train_txt)
clf.fit(X, train_labels)
And prediction would become (showing the procedure for one document at
a time now)
x = vect.transform([one_document])
[labels] = clf.predict(x)
if len(labels) == 0:
# apply the trick I described above
Good luck,
Lars
--
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general