On Thu, Feb 20, 2003 at 06:37:48PM -0500, Rado Bonk wrote: > Dear R-users, > > I have two outliers related questions. > > I. > I have a vector consisting of 69 values. > > mean = 0.00086 > SD = 0.02152 > > The shape of EDA graphics (boxplots, density plots) is heavily distorted > due to outliers. How to define the interval for outliers exception? Is > <2SD - mean + 2SD> interval a correct approach?
Yikes. There's been a lot of discussion of this over the years; these discussions usually generate more heat than light. <personal bias> Throwing away outliers without further investigation is often considered a bad idea. The argument is that you get into a situation where you are rejecting data because it doesn't fit the model, which is a strange approach. The most famous case of this was satelite data on ozone thickness over Antarctica - the ozone hole was missed for years because of an automatic outlier-rejection routine in the data analysis. If those outliers hadn't been rejected, the steps taken could've been done sooner, avoiding a lot of dammage. My own work is in industrial process control - if I ignored outliers, I'd make an awful lot of very bad mistakes, and wouldn't have a job for long. Outliers aren't necessarily wrong - sometimes the data is trying to tell you something. </personal bias> Robust summaries are another way. Check out the help pages for mad(), IQR(), fivenum(). Having said that, if you want to compare outlier-free data with your raw data to help enlighten you about where those outliers might be comming from, something like this might help... ss <- mad(myvec) mm <- median(myvec) ind <- (myvec > mm - 3*ss & myvec < mm + 3*ss) # or ind2 <- (myvec > quantile(myvec,0.025) & myvec <quantile(myvec,0.975)) boxplot(myvec[ind]) boxplot(myvec[ind2]) Cheers Jason -- Indigo Industrial Controls Ltd. 64-21-343-545 [EMAIL PROTECTED] ______________________________________________ [EMAIL PROTECTED] mailing list http://www.stat.math.ethz.ch/mailman/listinfo/r-help
