Andy,

Thanks for your email.

I understand that by default, the sampsize variable will use the behavior
variable that we are classifying as the strata variable.

Then, I could set sampsize=c(no=89, yes=11). I implemented that but I got
99% classification error rate on the yes value. When I oversample on the yes
values by taking sampsize=c(50,50) I get more or less equal classification
error rate of 50% for both the yes and no values.

Is there some science to what extent we should oversample on the imbalanced
level?

Thanks for your reply and all your help.

Raghu


On 12/4/08, Liaw, Andy <[EMAIL PROTECTED]> wrote:
>
> If I understand your situation correctly, you may be able to make use of
> the "strata" and "sampsize" arguments in randomForest() to get bootstrap
> samples that resemble the original data distribution.  They allow you to
> specify stratified samples using the "strata" variable.
>
> Best,
> Andy
>
> From: Raghu Naik
> >
> > Folks,
> >
> > I have a query around weighting in Random Forest (RF). I know
> > that several
> > earlier emails in this group have raised this issue, but I
> > did not find an
> > answer to my query.
> >
> > I am working on a dataset (dataset1) that consists of 4
> > million records that
> > can be reduced to a dataset (dataset2) of approximately 1500
> > unique records
> > with frequency counts that add up to the 4 million records
> > number as above.
> > Because of size issues, I cannot work with dataset1 in R and
> > therefore, I am
> > working with dataset2 .
> >
> > Each record consists of whether or not a patient chose a
> > particular drug
> > based on 14 comorbidity (Yes / No) variables; I am using RF
> > to understand
> > the comorbidity drivers of drug adoption (yes/no) classification.
> >
> > At full dataset level (dataset1), the drug adoption incidence
> > is ~11%. At
> > the reduced dataset dataset2 level, the drug adoption
> > incidence increases to
> > ~38%.
> >
> > My question is that, if am using the reduced dataset
> > (dataset2), how should
> > I inform RF that the adoption incidence at the full dataset
> > level was 11%.
> > Should that be used as a classwt prior with
> > classwt=c(Yes=.11, No=.89)? My
> > understanding is that RF does not allow case weighting.
> > Or can this be handled with the sampsize arguement through
> > oversampling?
> > What proportions should one use for this (e.g., sampsize=c(Yes=100,
> > No=100))?
> >
> >
> >
> > I would appreciate any feedback or pointers to any earlier
> > thread that I may
> > have overlooked.
> >
> > Regards,
> >
> > Raghu
> Notice:  This e-mail message, together with any attach...{{dropped:17}}

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to