Thanks Peter, Ted! Best, Anirban
On Tue, Jul 12, 2011 at 4:54 AM, Ted Harding <ted.hard...@wlandres.net> wrote: > On 11-Jul-11 07:55:44, Anirban Mukherjee wrote: >> Hi all, >> >> I wanted to mark the estimation sample: mark what rows (observations) >> are deleted by lm due to missingness. For eg, from the original >> example in help, I have changed one of the values in trt to be NA >> (missing). >> >># code below >># ---- >># original example >>> ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14) >>> trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69) >> >># change 18th observation of trt >>> trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,NA,4.32,4.69) >>> group <- gl(2,10,20, labels=c("Ctl","Trt")) >>> weight <- c(ctl, trt) >>> lm.D9 <- lm(weight ~ group) >>> summary(lm.D9) >> >> Call: >> lm(formula = weight ~ group) >> >> Residuals: >> ____ Min______ 1Q__ Median______ 3Q_____ Max >> -1.04556 -0.48378_ 0.05444_ 0.23622_ 1.39444 >> >> Coefficients: >> ___________ Estimate Std. Error t value Pr(>|t|) >> (Intercept)__ 5.0320____ 0.2258_ 22.281 5.09e-14 *** >> groupTrt____ -0.3964____ 0.3281_ -1.208___ 0.244 >> --- >> Signif. codes:_ 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 >> >> Residual standard error: 0.7142 on 17 degrees of freedom >> _ (1 observation deleted due to missingness) >> Multiple R-squared: 0.07907,___ Adjusted R-squared: 0.0249 >> F-statistic:_ 1.46 on 1 and 17 DF,_ p-value: 0.2435 >> >># ------ >># end snippet >> >> I want to generate an indicator variable to mark the observations used >> in estimation: 1 for a row not deleted, 0 for a row deleted. In this >> case I want an indicator variable that has seventeen 1s, one 0, and >> then 2 1s. I know I can do ind = !is.na(group) in the above example. >> But I am ideally looking for a way that allows one to use any formula >> in lm, and still be able to mark the estimation sample. >> Function/option I am missing? The best I could come up with: >> >>> lm.D9 <- lm(weight ~ group, model=TRUE) >>> ind <- as.numeric(row.names(lm.D9$model)) >>> esamp <- rep(0,length(group)) #substitute nrow(data.frame used in >>> estimation) for length(group) >>> esamp[ind] <- 1 >>> esamp >> [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 >> >> Is this "safe" (recommended?)? >> >> Appreciate any help. >> >> Best, A > > Separately from Peter Dalgaard's response, you raise a generic > quedtion about how to find out which observations have been > used in an LM fit when some cases may have been omitted, e.g. > because of missing values (NA). > > Take the following as an example: > > X <- (1:10) > Y <- X + rnorm(10) > LM <- lm(Y ~ X) > > X1 <- X > X1[c(4,8)] <- NA ## so cases 4 & 8 will be omitted > LM1 <- lm(Y ~ X1) > > row.names(LM$model) > # [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" > row.names(LM1$model) > # [1] "1" "2" "3" "5" "6" "7" "9" "10" > > which( (row.names(LM$model) %in% row.names(LM1$model)) ) > # [1] 1 2 3 5 6 7 9 10 > ### These are the indices of the cases which were kept > > which(!(row.names(LM$model) %in% row.names(LM1$model)) ) > # [1] 4 8 > ### These are this indices of the cases which were omitted > > You could also use 'names(LM$res)' and 'names(LM1$res)' > instead of 'row.names(LM$model' and 'row.names(LM$model)' > in the above. > > Hoping this helps, > Ted. > > -------------------------------------------------------------------- > E-Mail: (Ted Harding) <ted.hard...@wlandres.net> > Fax-to-email: +44 (0)870 094 0861 > Date: 11-Jul-11 Time: 21:54:05 > ------------------------------ XFMail ------------------------------ > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.