Some learning algorithms deal with this better than others. The problem is particularly bad in information retrieval (negative examples include almost the entire corpus, positives are a tiny fraction) and fraud (less than 1% of the training data is typically fraud).
Down-sampling the over-represented case is the simplest answer where you have lots of data. It doesn't help much to have more than 3x more data for one case as another anyway (at least in binary decisions). Another aspect of this is the cost of different errors. For instance, in fraud, verifying a transaction with a customer has low cost (but not non-zero) while not detecting a fraud in progress can be very, very bad. False negatives are thus more of a problem than false positives and the models are tuned accordingly. On Wed, Jul 22, 2009 at 4:03 PM, Miles Osborne <[email protected]> wrote: > this is the class imbalance problem (ie you have many more instances for > one class than another one). > > in this case, you could ensure that the training set was balanced (50:50); > more interestingly, you can have a prior which corrects for this. or, you > could over-sample or even under-sample the training set, etc etc. > -- Ted Dunning, CTO DeepDyve
