Did you try CBayes. Its supposed to negate the class imbalance effect to some extend
On Thu, Jul 23, 2009 at 5:02 AM, Ted Dunning<[email protected]> wrote: > Some learning algorithms deal with this better than others. The problem is > particularly bad in information retrieval (negative examples include almost > the entire corpus, positives are a tiny fraction) and fraud (less than 1% of > the training data is typically fraud). > > Down-sampling the over-represented case is the simplest answer where you > have lots of data. It doesn't help much to have more than 3x more data for > one case as another anyway (at least in binary decisions). > > Another aspect of this is the cost of different errors. For instance, in > fraud, verifying a transaction with a customer has low cost (but not > non-zero) while not detecting a fraud in progress can be very, very bad. > False negatives are thus more of a problem than false positives and the > models are tuned accordingly. > > On Wed, Jul 22, 2009 at 4:03 PM, Miles Osborne <[email protected]> wrote: > >> this is the class imbalance problem (ie you have many more instances for >> one class than another one). >> >> in this case, you could ensure that the training set was balanced (50:50); >> more interestingly, you can have a prior which corrects for this. or, you >> could over-sample or even under-sample the training set, etc etc. >> > > > > -- > Ted Dunning, CTO > DeepDyve >
