Ok, so I assume you do sentiment classification?

For millions of examples I definitely recommend using either
NaiveBayes or SGDClassifier. I'd start with a bernoulli NB as a
baseline.

Personally, I hardly use IDF weighting for sentiment classification;
words with low document frequency are usually proper nouns which are
not that indicative of sentiment. Furthermore, typos have low document
frequency too...

I strongly recommend further token normalization (contractions,
negations, smileys, repeated chars) which allows you to tackle the
problem of data sparseness (how many features do you have?)

For sentiment classification, don't be too aggressive on punctuation
(e.g. repeated ! and ? are valuable indicators).

To further improve performance you can try SGDClassifier and tune
alpha via grid search (usually a coarse search will do), you don't
need more than a hand full of epochs for a dataset of this size.
Personally, I prefer the modified huber loss over hinge loss
(=default) but that's more a subjective choice.

As Olivier suggested, bigrams may help but they make the data
sparseness problem even worse - so try to counter with more aggressive
regularization.

Hope this helps,

Peter

2012/2/2 adnan rajper <[email protected]>:
> Hi Peter,
>
> number of samples: 1 million tweets
> number of features: I use the bag of words model, in-fact I have followed
> this example
>  http://scikit-learn.github.com/scikit-learn-tutorial/working_with_text_data.html.
> It uses TF-IDF normalization.
> class distribution: equal number of positive and negative tweets
> features: I removed the stop words, punctuations, URLs and user names,
>
> Adnan
> ________________________________
> From: Peter Prettenhofer <[email protected]>
> To: [email protected]
> Sent: Thursday, February 2, 2012 2:20 PM
> Subject: Re: [Scikit-learn-general] Improving the accuracy of classifier
>
> Hi Adnan,
>
> can you give use some more specific information about your learning
> task / dataset including:
>
> - number of samples
>
> - number of features
>
> - class distribution
>
> - features (normalization, preprocessing)
>
> best,
> Peter
>
> 2012/2/2 adnan rajper <[email protected]>:
>> hi everybody,
>>
>> I am using multinomial and LinearSVC classifier with default parameters to
>> classify twitter messages into two classes (positive or negative). I
>> followed the tutorial
>>
>> on http://scikit-learn.github.com/scikit-learn-tutorial/working_with_text_data.html.
>> I tried "parameter tuning using grid search",  but it gets too slow. Both
>> classifiers (multinomial and LinearSVC) give 75% accuracy. My problem is
>> that I want to improve the accuracy, for instance I want to make it more
>> than 80%. Is there anyway to do it through scikit.
>>
>>
>> thanks
>> Adnan
>>
>>
>> ------------------------------------------------------------------------------
>> Keep Your Developer Skills Current with LearnDevNow!
>> The most comprehensive online learning library for Microsoft developers
>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
>> Metro Style Apps, more. Free future releases when you subscribe now!
>> http://p.sf.net/sfu/learndevnow-d2d
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>
>
>
> --
> Peter Prettenhofer
>
> ------------------------------------------------------------------------------
> Keep Your Developer Skills Current with LearnDevNow!
> The most comprehensive online learning library for Microsoft developers
> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
> Metro Style Apps, more. Free future releases when you subscribe now!
> http://p.sf.net/sfu/learndevnow-d2d
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
> ------------------------------------------------------------------------------
> Keep Your Developer Skills Current with LearnDevNow!
> The most comprehensive online learning library for Microsoft developers
> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
> Metro Style Apps, more. Free future releases when you subscribe now!
> http://p.sf.net/sfu/learndevnow-d2d
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>



-- 
Peter Prettenhofer

------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to