In continuation of the discussion on `Winsorisation' that has taken place on r-sig-finance today, I thought I'd present all of you with an interesting dataset and a question.
This data is the daily stock returns of the large Indian software firm `Infosys'. (This is the symbol `INFY' on NASDAQ). It is a large number of observations of daily returns (i.e. percentage changes of the adjusted stock price). Load the data in -- print(load(url("http://www.mayin.org/ajayshah/tmp/infosys_mm.rda"))) str(x) summary(x) sd(x) The name `rj' is used for returns on Infosys, and `rM' is used for returns on the stock market index (Nifty). There are three really weird observations in this. weird.rj <- c(1896,2395) weird.rM <- 2672 x[weird.rj,] x[weird.rM,] As you can see, these observations are quite remarkable given the small standard deviations that we saw above. There is absolutely no measurement error here. These things actually happened. Now consider a typical application: using this to estimate a market model. The goal here is to estimate the coefficient of a regression of rj on rM. # A regression with all obs summary(lm(rj ~ rM, data=x)) # Drop the weird rj -- summary(lm(rj ~ rM, data=x[-weird.rj,])) # Drop the weird rM -- summary(lm(rj ~ rM, data=x[-weird.rM,])) # Drop both kinds of weird observations -- summary(lm(rj ~ rM, data=x[-c(weird.rM,weird.rj),])) # Robust regressions library(MASS) summary(rlm(rj ~ rM, data=x)) summary(rlm(rj ~ rM, method="MM", data=x)) library(robust) summary(lmRob(rj ~ rM, data=x)) library(quantreg) summary(rq(rj ~ rM, tau=0.5, data=x)) So you see, we have a variety of different estimates for the slope (which is termed `beta' in finance). What value would you trust the most? And, would winsorisation using either my code (https://stat.ethz.ch/pipermail/r-sig-finance/2008q3/002921.html) or Patrick Burns' code (https://stat.ethz.ch/pipermail/r-sig-finance/2008q3/002923.html) be a good idea here? I'm instinctively unhappy with any scheme based on discarding observations that I'm absolutely sure have no measurement error. We have to model the weirdness of this data generating process, not ignore it. -- Ajay Shah http://www.mayin.org/ajayshah [EMAIL PROTECTED] http://ajayshahblog.blogspot.com <*(:-? - wizard who doesn't know the answer. _______________________________________________ R-SIG-Robust@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-robust