Hello,
I am creating a content classifier using scikit-learn through
HashingVectorizer( using this as reference:
http://scikit-learn.org/dev/auto_examples/applications/plot_out_of_core_classification.html
).
The training dataset I am using wikipedia. For example, for "management"
category I am training it with few articles related to management. i.e.
Entire article related to management is one training instance.
I did training with 50 categories and total of ~4000 training instances.
But when I test the prediction for a new sentence or text, it gives wrong
prediction.
So my question is do I need to break wikipedia article into sentences and
label each sentence with category name to make it work correctly? Since I
am using HashingVectorizer, my intuition is it is creating a hash for
entire training instance and not for tokens in it. Is it true?
Also, Could someone please throw some light on how HashingVectorizer works?
Thanks,
Kartik
--
Regards,
Kartik Perisetla
------------------------------------------------------------------------------
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general