Re: [Scikit-learn-general] Regarding content classification using HashingVectorizer

Kartik Kumar Perisetla Thu, 24 Jul 2014 07:45:08 -0700

I actually used part of text of one wikipedia article which was used in
training. I was expecting it to detect the category for which it was used
as training instance. But it predicted as some other category and thus I
thought it did not give accurate prediction.


Please correct my understanding if its wrong here.

Thanks,
Kartik


On Thu, Jul 24, 2014 at 2:21 PM, Eustache DIEMERT <eusta...@diemert.fr>
wrote:

> > But when I test the prediction for a new sentence or text, it gives
> wrong prediction.
>
> How do you measure that ?
>
> Having a few badly classified instances does not necessarily means the
> learning has failed.
>
> A good classification accuracy for text classification is typically > 80%,
> what is yours ?
>
> Also, HashingVectorizer is not really involved in classification accuracy
> here - IMHO.
>
> The main factor would probably be how close your new examples are to the
> training set. E.g. in the out-of-core example we keep the first 1000
> instances for testing. If you just ask predictions for texts taken from
> other sources the classification would probably be worse...
>
> HTH
>
> Eustache
>
>
> 2014-07-24 4:35 GMT+02:00 Kartik Kumar Perisetla <kartik.p...@gmail.com>:
>
>> Hello,
>>
>> I am creating a content classifier using scikit-learn through
>> HashingVectorizer( using this as reference:
>> http://scikit-learn.org/dev/auto_examples/applications/plot_out_of_core_classification.html
>> ).
>>
>> The training dataset I am using wikipedia. For example, for "management"
>> category I am training it with few articles related to management. i.e.
>> Entire article related to management is one training instance.
>>
>> I did training with 50 categories and total of ~4000 training instances.
>> But when I test the prediction for a new sentence or text, it gives wrong
>> prediction.
>>
>> So my question is do I need to break wikipedia article into sentences and
>> label each sentence with category name to make it work correctly? Since I
>> am using HashingVectorizer, my intuition is it is creating a hash for
>> entire training instance and not for tokens in it. Is it true?
>>
>> Also, Could someone please throw some light on how HashingVectorizer
>> works?
>>
>> Thanks,
>> Kartik
>>
>> --
>> Regards,
>>
>> Kartik Perisetla
>>
>>
>> ------------------------------------------------------------------------------
>> Want fast and easy access to all the code in your enterprise? Index and
>> search up to 200,000 lines of code with a free copy of Black Duck
>> Code Sight - the same software that powers the world's largest code
>> search on Ohloh, the Black Duck Open Hub! Try it now.
>> http://p.sf.net/sfu/bds
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> ------------------------------------------------------------------------------
> Want fast and easy access to all the code in your enterprise? Index and
> search up to 200,000 lines of code with a free copy of Black Duck
> Code Sight - the same software that powers the world's largest code
> search on Ohloh, the Black Duck Open Hub! Try it now.
> http://p.sf.net/sfu/bds
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>


-- 
Regards,

Kartik Perisetla

------------------------------------------------------------------------------
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Regarding content classification using HashingVectorizer

Reply via email to