Re: [Scikit-learn-general] Regarding content classification using HashingVectorizer

Eustache DIEMERT Thu, 24 Jul 2014 01:52:18 -0700

> But when I test the prediction for a new sentence or text, it gives wrong
prediction.


How do you measure that ?

Having a few badly classified instances does not necessarily means the
learning has failed.

A good classification accuracy for text classification is typically > 80%,
what is yours ?

Also, HashingVectorizer is not really involved in classification accuracy
here - IMHO.

The main factor would probably be how close your new examples are to the
training set. E.g. in the out-of-core example we keep the first 1000
instances for testing. If you just ask predictions for texts taken from
other sources the classification would probably be worse...

HTH

Eustache


2014-07-24 4:35 GMT+02:00 Kartik Kumar Perisetla <kartik.p...@gmail.com>:

> Hello,
>
> I am creating a content classifier using scikit-learn through
> HashingVectorizer( using this as reference:
> http://scikit-learn.org/dev/auto_examples/applications/plot_out_of_core_classification.html
> ).
>
> The training dataset I am using wikipedia. For example, for "management"
> category I am training it with few articles related to management. i.e.
> Entire article related to management is one training instance.
>
> I did training with 50 categories and total of ~4000 training instances.
> But when I test the prediction for a new sentence or text, it gives wrong
> prediction.
>
> So my question is do I need to break wikipedia article into sentences and
> label each sentence with category name to make it work correctly? Since I
> am using HashingVectorizer, my intuition is it is creating a hash for
> entire training instance and not for tokens in it. Is it true?
>
> Also, Could someone please throw some light on how HashingVectorizer works?
>
> Thanks,
> Kartik
>
> --
> Regards,
>
> Kartik Perisetla
>
>
> ------------------------------------------------------------------------------
> Want fast and easy access to all the code in your enterprise? Index and
> search up to 200,000 lines of code with a free copy of Black Duck
> Code Sight - the same software that powers the world's largest code
> search on Ohloh, the Black Duck Open Hub! Try it now.
> http://p.sf.net/sfu/bds
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>

------------------------------------------------------------------------------
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Regarding content classification using HashingVectorizer

Reply via email to