Re: [Scikit-learn-general] Regarding content classification using HashingVectorizer

Olivier Grisel Thu, 24 Jul 2014 15:32:30 -0700

2014-07-24 16:43 GMT+02:00 Kartik Kumar Perisetla <kartik.p...@gmail.com>:
> I actually used part of text of one wikipedia article which was used in
> training. I was expecting it to detect the category for which it was used as
> training instance. But it predicted as some other category and thus I
> thought it did not give accurate prediction.
>
> Please correct my understanding if its wrong here.


Models can underfit, that is fail to giv perfect predictions even on
the training set.

For text classification as for other tasks, underfitting problem can be caused
both by problems at the extracted features level, inadequate model parameter
settings (e.g. the strength model regularization), inadequate model class and
label noise (bad quality of the class labels them-selves)

A good way to understand model underfitting and overfitting (in
relation to the training set size) is to plot learning curves, both
for the score on the training set and on the validation set, see for
instance:

http://scikit-learn.org/stable/auto_examples/plot_learning_curve.html

-- 
Olivier

------------------------------------------------------------------------------
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Regarding content classification using HashingVectorizer

Reply via email to