I have a few questions but can't find good answers after searching on the internet.
1. Normally to what degree would an imbalanced dataset be? It's clear that imbalanced dataset comprises of majority and minority classes. But it's not clear that how many percentage the minor classes usually be. After googling , some refer that his imbalanced dataset contains 10% minority 90% majority[1]. What about 20%/ 30% minority, 80%/70% majority? 2. What kind of data distribution can be seen as good for an imbalanced dataset? Or what technique (Principal Component Analysis?) can be applied to find data distribution? I use balanced random forest for dealing imbalanced dataset and test against dataset found at [2]. The result looks ok and stable (as A). However, when changing to dataset (say dataset X) other than that in [2]. The confusion matrix results changes drastically and are unstable when repeating several times (as B). The different part I can think of is the way how I encode the values in the dataset X and the dataset found at [2] is a dense matrix (nearly every column with all rows full of values). The dataset X has some characteristics: 1. it has hundred columns 2. many columns do not contain values (sparse matrix.) 3. a few columns (which can be seen as database primary/ unique key) contain values with factorized result nearly the same as the row count of the column length. For example, the dataset has 500 rows where the column a contain 489 different values, it becomes len(numpy.unique(a_values_array, return_inverse=True)[0]) = 489. 4. many columns are categorical so most of values are encoded using numpy.unique(array, return_inverse=True)[1] The way how I encode/ factorize values is: - remove columns which do not contain many values. For example, column value count falls below a threshold e.g. 50. So if a column with 500 rows contains only 49 values will be removed. - encode columns which has various classification (like 3 described above) to 0/ 1; i.e. having value is encoded as 1, otherwise 0. - factorize categorical columns with e.g. numpy.unique(column_value_array, return_inverse=True)[1] - leave numeric columns as it were. I appreciate any suggestion. Many thanks. A.) uci dataset test set -> b label has 50 records, a label has 470 records. a b a 377 93 b 11 39 test set -> b label has 41 records, a label has 437 records. a b a 370 67 b 8 33 B.) other dataset b has 24 records, a has 152 records, percent of b: 13.64% a b a 95 57 b 8 16 b has 20 records, a has 137 records, percent of b: 12.74% a b a 118 19 b 15 5 b has 20 records, a has 169 records, percent of b: 10.58% a b a 148 21 b 17 3 [1]. http://stats.stackexchange.com/questions/60564/should-my-test-set-be-balanced-or-imbalanced [2]. http://www.cs.gsu.edu/~zding/research/benchmark-data.php ------------------------------------------------------------------------------ October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register > http://pubads.g.doubleclick.net/gampad/clk?id=60135991&iu=/4140/ostg.clktrk _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
