[Scikit-learn-general] Regarding content classification using HashingVectorizer

Kartik Kumar Perisetla Wed, 23 Jul 2014 19:37:14 -0700

Hello,

I am creating a content classifier using scikit-learn through
HashingVectorizer( using this as reference:
http://scikit-learn.org/dev/auto_examples/applications/plot_out_of_core_classification.html
).


The training dataset I am using wikipedia. For example, for "management"
category I am training it with few articles related to management. i.e.
Entire article related to management is one training instance.

I did training with 50 categories and total of ~4000 training instances.
But when I test the prediction for a new sentence or text, it gives wrong
prediction.

So my question is do I need to break wikipedia article into sentences and
label each sentence with category name to make it work correctly? Since I
am using HashingVectorizer, my intuition is it is creating a hash for
entire training instance and not for tokens in it. Is it true?

Also, Could someone please throw some light on how HashingVectorizer works?

Thanks,
Kartik

-- 
Regards,

Kartik Perisetla

------------------------------------------------------------------------------
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

[Scikit-learn-general] Regarding content classification using HashingVectorizer

Reply via email to