[Scikit-learn-general] Imbalanced dataset

ChungHung Liu Sat, 14 Sep 2013 03:49:50 -0700

I encounter imbalanced dataset problem with minority class around 0.3k and 
majority class around 15k. I read some documents saying down sampling or over 
sampling can apply to such problem. After testing, it shows that with down 
sampling, dataset needs to be reduced to around 700 then the confusion matrix 
would look ok. Although the result looks ok, the size is too small.


confusion matrix:
 preds  A  B
actual            
A         8   79
B        73   15
 
However, with over sampling (replicating minority class), no mater how many 
minority class are over sampled e.g. 11% 30%, 50% ( where precentage = # 
minority / total rows dataset). The confusion matrix result doesn't look good 
(many data are misclassified)

confusion matrix:
 preds  A   B
actual             
A        50  2707
B      2549    44

Then, I find that http://comments.gmane.org/gmane.comp.python.scikit-learn/5278 
can be used to perform sampling by SMOTE. But the result is similar to over 
sampling (many misclassified class). The way of sampling is done  by  
    # simplified steps
    down_sampled_majority_samples = shuffle(majority_samples) * 70/100
    # testing percentage includes 100, 2*100, 5*100 but the result is similar
    synthetic_minority = SMOTE(minority_samples, 12*100, 5) 
    train_data = synthetic_minority + minority_samples + 
down_sampled_majority_samples    

Generally what procedure or should be paid attention to when working on 
imbalanced dataset?

Thanks for advices

------------------------------------------------------------------------------
LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
1,500+ hours of tutorials including VisualStudio 2012, Windows 8, SharePoint
2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack includes
Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/22/13. 
http://pubads.g.doubleclick.net/gampad/clk?id=64545871&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

[Scikit-learn-general] Imbalanced dataset

Reply via email to