I encounter imbalanced dataset problem with minority class around 0.3k and majority class around 15k. I read some documents saying down sampling or over sampling can apply to such problem. After testing, it shows that with down sampling, dataset needs to be reduced to around 700 then the confusion matrix would look ok. Although the result looks ok, the size is too small.
confusion matrix: preds A B actual A 8 79 B 73 15 However, with over sampling (replicating minority class), no mater how many minority class are over sampled e.g. 11% 30%, 50% ( where precentage = # minority / total rows dataset). The confusion matrix result doesn't look good (many data are misclassified) confusion matrix: preds A B actual A 50 2707 B 2549 44 Then, I find that http://comments.gmane.org/gmane.comp.python.scikit-learn/5278 can be used to perform sampling by SMOTE. But the result is similar to over sampling (many misclassified class). The way of sampling is done by # simplified steps down_sampled_majority_samples = shuffle(majority_samples) * 70/100 # testing percentage includes 100, 2*100, 5*100 but the result is similar synthetic_minority = SMOTE(minority_samples, 12*100, 5) train_data = synthetic_minority + minority_samples + down_sampled_majority_samples Generally what procedure or should be paid attention to when working on imbalanced dataset? Thanks for advices ------------------------------------------------------------------------------ LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99! 1,500+ hours of tutorials including VisualStudio 2012, Windows 8, SharePoint 2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack includes Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/22/13. http://pubads.g.doubleclick.net/gampad/clk?id=64545871&iu=/4140/ostg.clktrk _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general