Dear R-experts,
Searching the help archives I found a recommendation to do multivariate outlier identification by mahalanobis distances based on a robustly estimated covariance matrix and compare the resulting distances to a chi^2-distribution with p (number of your variables) degrees of freedom. I understand that compared to euclidean distances this has the advantage of being scale-invariant. However, it seems that such mahalanobis distances are not invariant to redundancies: adding a highly collinear variable changes the mahalanobis distances (see code below). Isn't also the comparision to chi^2 assuming that all variables are independent? Can anyone recommend a procedure to calculate distances and identify multivariate outliers which is invariant to the degree of collinearity? Thanks to any advice Jens Oehlschl�gel # Example code library(MASS) # generate bivariate normal test data n <- 500 x <- matrix(rnorm(n*2), ncol=2) # scale, otherwise euclidean fails x <- scale(x) cr <- cov.rob(x, method="mcd") center <- cr$center # calculate squared euclidean and mahalanobis d <- rowSums(t(t(x)-center)^2) m <- as.vector(mahalanobis(x, center, cr$cov)) # euclidean an dmahalanobis basically coincide, mahalanobis slightly biased by robust covariance underestimation eqscplot(x=d, y=m); abline(0,1) # Now I add a highly redundant column in hope the distances between cases will not change x2 <- cbind(x, x[,1]+rnorm(n, sd=0.01)) # scale, otherwise euclidean fails x2 <- scale(x2) cr2 <- cov.rob(x2, method="mcd") center2 <- cr2$center d2 <- rowSums(t(t(x2)-center2)^2) m2 <- as.vector(mahalanobis(x2, center2, cr2$cov)) # though equally scaled, euclidean and mahalanobis diverge eqscplot(x=d2, y=m2); abline(0,1) # mahalanobis distances are obviously not redundancy invariant eqscplot(x=m, y=m2); abline(0,1) # especially if rank order of distances is considered eqscplot(x=rank(m), y=rank(m2)); abline(0,1) cor(m, m2) cor(m, m2, method="spearman") # euclidean distances look better but are also not redundancy invariant eqscplot(x=d, y=d2); abline(0,1) eqscplot(x=rank(d), y=rank(d2)); abline(0,1) cor(d, d2) cor(d, d2, method="spearman") -- Bis 31.1.: TopMail + Digicam f�r nur 29 EUR http://www.gmx.net/topmail ______________________________________________ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
