My problem is basically solved now. Mainly it is noisy data after original dataset is transformed into numeric values. The model would perform better with grouping categorical data than simply execute e.g. pd.factorize() function which may creating a large unique list.
Thanks for all your help. Sincerely ----- Original Message ----- From: Olivier Grisel <olivier.gri...@ensta.org> To: ChungHung Liu <chliu52...@yahoo.co.uk>; scikit-learn-general <scikit-learn-general@lists.sourceforge.net> Cc: Sent: Wednesday, 18 September 2013, 4:43 Subject: Re: [Scikit-learn-general] Imbalanced dataset You might want to try to cascade a high precision linear classifier (by tuning the intercept_ attribute based on the PR-curve) to trim most of the majority class with a second stage classifier as described in this paper by Google: http://research.google.com/pubs/pub37195.html I have never tried it my-self yet but it sounds interesting to try and should be doable by using sklearn models as building blocks. ------------------------------------------------------------------------------ October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register > http://pubads.g.doubleclick.net/gampad/clk?id=60134791&iu=/4140/ostg.clktrk _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general