On Oct 22, 2011, at 6:57 AM, aajit75 wrote:
Dear All,
I have got the limits for removing extreme values for each variables
using
following function .
f=function(x){quantile(x, c(0.25, 0.75),na.rm = TRUE) -
matrix(IQR(x,na.rm =
TRUE) * c(1.5), nrow = 1) %*% c(-1, 1)}
I think you need to clarify what your expectations are for that
function. First you calculate the interquartile range and then you
subtract 1.5 times the interquartile range. Exactly how does that
identify extreme values? It appears you would be removing substantial
amounts of your data.
#Example:
n <- 100
x1 <- runif(n)
x2 <- runif(n)
x3 <- x1 + x2 + runif(n)/10
x4 <- x1 + x2 + x3 + runif(n)/10
x5 <- factor(sample(c('a','b','c'),n,replace=TRUE))
x6 <- 1*(x5=='a' | x5=='c')
data1 <- cbind(x1,x2,x3,x4,x5,x6)
data2 <- data.frame(data1)
xyz <- lapply(data1, f)
Have you looked at the output of that operation? I get a list of 600
elements:
> str(xyz)
List of 600
$ : num [1, 1:2] 0.315 0.315
$ : num [1, 1:2] 0.0132 0.0132
$ : num [1, 1:2] 0.519 0.519
$ : num [1, 1:2] 0.0917 0.0917
snipped
#Now, I can eliminate those rows(observations) from the data which
contains
extreme values for each of the variables one by one as below.
And now you propose to overwrite data2 not one but twice?
data2 <- subset (data2, x1<=xyz$x1[,1] & x1>=xyz$x1[,2])
data2 <- subset (data2, x1<=xyz$x2[,1] & x1>=xyz$x2[,2])
.
.
and so on..
But my data has more number of variables (more than 120), can any
body
suggest efficient way of eliminating rows containg extreme values?
The first step would be arriving at a sensible definiton for "extreme
value". And you should also consider that these are data and removing
"extreme values" is a serious distortion of the data. There needs to
be some justification for cutting out the extremes.
--
David Winsemius, MD
West Hartford, CT
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.