Dear Brian, Whenever I disagree with you, I wonder what error I made, and almost always discover that there was something I missed.
> -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Prof > Brian Ripley > Sent: Saturday, February 19, 2005 1:09 AM > To: John Fox > Cc: 'Markus Jantti'; r-devel@stat.math.ethz.ch; 'Liaw,Andy' > Subject: [Rd] RE: [R] using poly in a linear regression in > the presence of NAf ails (despite subsetting them out) > > On Fri, 18 Feb 2005, John Fox wrote: > > > Dear Andy, Brian, and Markus, > > > > I've moved this to r-devel because the issue is a bit esoteric. I > > apologize for joining the discussion so late, but didn't have time > > earlier in the week to formulate these ideas. > > I think you didn't take time to check what happens in R-devel, though. I did check the R 2.1 news file for poly, na.action, and na.omit, but didn't find anything. Did I miss something? > > > I believe that I understand and appreciate Brian's point, but think > > that the issue isn't entirely clear-cut. > > > > It seems to me that two kinds of questions arise with missing data. > > The deeper ones are statistical but there are also nontrivial > > mechanical issues such as the one here. > > > > Whenever a term in a model is parametrized with a data-dependent > > basis, there's a potential for problems and confusion -- > manifested, > > for example, in the issue of "safe" prediction. I don't think that > > these problems are unique to missing data. On the other hand, the > > basis selected for the subspace spanned by a term shouldn't > be the most important consideration. > > That is, when models are equivalent -- as, for example lm(y ~ x + > > I(x^2)) and lm(y ~ poly(x, 2)), an argument could be made > for treating > > them similarly. > > Only in linear models, and we are here talking about general > behaviour of finding the model frame. > That's a good point -- I hadn't considered it. I don't see the full ramifications, however. > > Brian's point that NAs, say, in x2, can influence the basis for > > poly(x1, 2) is disquieting, but note that this can happen > now if there are no NAs in x1. > > That is not what I said, and it is incorrect. poly(x1, 2) is > determined by the set of values of x1 (so they need all to be > known hence no NAs in > x1) and nothing else. Example: What you said was, "The problem is that poly(x, 2) depends on the possible set of x values, and so needs to know all of them, unlike e.g. log(x) which is observation-by-observation. Silently omitting missing values is not a good idea in such cases, especially if the values are missing in other variables (which is what na.action is likely to do)." I misinterpreted it. > > > y <- rnorm(10) > > x1 <- 1:10 > > x2 <- c(NA, runif(9)) > > model.frame(y ~ poly(x1, 2) + x2, na.action = na.omit) > y poly(x1, 2).1 poly(x1, 2).2 x2 > 2 -0.5110095 -0.38533732 0.17407766 0.07377988 > 3 -0.9111954 -0.27524094 -0.08703883 0.30968660 > 4 -0.8371717 -0.16514456 -0.26111648 0.71727174 > 5 2.4158352 -0.05504819 -0.34815531 0.50454591 > 6 0.1340882 0.05504819 -0.34815531 0.15299896 > 7 -0.4906859 0.16514456 -0.26111648 0.50393349 > 8 -0.4405479 0.27524094 -0.08703883 0.49396092 > 9 0.4595894 0.38533732 0.17407766 0.75120020 > 10 -0.6937202 0.49543369 0.52223297 0.17464982 > > model.frame(y ~ poly(x1, 2) + x2, na.action = na.pass) > y poly(x1, 2).1 poly(x1, 2).2 x2 > 1 -0.1102855 -0.49543369 0.52223297 NA > 2 -0.5110095 -0.38533732 0.17407766 0.07377988 > 3 -0.9111954 -0.27524094 -0.08703883 0.30968660 > 4 -0.8371717 -0.16514456 -0.26111648 0.71727174 > 5 2.4158352 -0.05504819 -0.34815531 0.50454591 > 6 0.1340882 0.05504819 -0.34815531 0.15299896 > 7 -0.4906859 0.16514456 -0.26111648 0.50393349 > 8 -0.4405479 0.27524094 -0.08703883 0.49396092 > 9 0.4595894 0.38533732 0.17407766 0.75120020 > 10 -0.6937202 0.49543369 0.52223297 0.17464982 > > The point that I was trying to make is this: > y <- rnorm(10) > x1 <- 1:10 > x2 <- c(NA, runif(9)) > mf <- model.frame(y ~ poly(x1, 2) + x2, na.action = na.omit) > cor(mf[,2]) 1 2 1 1.0000000 0.4037864 2 0.4037864 1.0000000 > > Data <- na.omit(data.frame(y, x1, x2)) > mf <- model.frame(y ~ poly(x1, 2) + x2, na.action = na.omit, data=Data) > cor(mf[,2]) 1 2 1 1 0 2 0 1 > > > The point, therefore, doesn't really justify the current > behaviour of > > poly(). Indeed, if there are NAs in x2 but not in x1, the columns > > representing poly(x1, 2) won't be orthogonal in the subset of cases > > used in the model fit (though they do provide a correct > basis for the term). > > It is currently documented to not allow NAs: > > x, newdata: a numeric vector at which to evaluate the polynomial. 'x' > can also be a matrix. Missing values are not > allowed in 'x'. > > Why does that need any justification? > Because one might expect (perhaps, you could argue, one shouldn't expect) the basis for the polynomial term to be orthogonal in the subset of observations actually used for the fit. > > Consider another example -- lm(y ~ f, subset = f != "a"), > where f is a > > factor with levels "a", "b", and "c", and where there are NAs in f. > > Here the basis for the f term is data dependent, in that > the baseline > > level is taken as "b" in the absence of "a", yet NAs don't > cause an error. > > > > Having poly(), bs(), and ns() report a more informative > error message > > certainly is preferable to the current situation, but an > alternative > > would be to have them work and perhaps report a warning. > > What exactly does `work' mean here? Run and give misleading answers? > That implies that there's a simple right answer answer here, which I don't think is the case. I'm persuaded that what I recommended is not good idea because of its more general implications, but it's not obvious to me that creating a basis that's orthogonal in the subset of observations used for the fit is misleading, while the current behaviour isn't misleading. > > If the object is to protect people from stepping into traps > because of > > missing data, then an argument could be made for having the default > > na.action be na.fail, as in S-PLUS. (I wouldn't advocate > this, by the > > way.) Perhaps I'm missing some consideration here, but isn't it > > clearest to allow the data, subset, and na.action arguments to > > determine the data in which the formula is evaluated? > > Exactly how? It would be a major change from the current > methodology and I am not sure you appreciate what that is. > Yes, I looked at how this works currently, but agree that I didn't appreciate the implications of changing it -- e.g., for nonlinear models. I still don't entirely appreciate those implications. > Remember that na.action could for example replace missing > values by random imputations, or even multiple imputations, > and na.action comes quite late in the process *after* the > model frame has been formed. Unlike factors, > poly() etc generate multiple columns in the model frame, and > then subset and na.action are applied. Factors are encoded > in model.matrix (and that is really only used for the linear > parts of models, unlike model.frame). > > I am not sure you appreciate the basic point: poly is applied > to the whole dataset and not to a subset, and that can be > important (e.g. to ensure no unstable extrapolation when > predicting later). Note, however, that a user could create the subset as a data frame through na.omit() prior to applying poly() -- as I did in the example above. In fact, that's probably what a user of poly() would now want to do. > Really na.action has to come last, as > functions in the formula could themselves generate NAs > (log(0) for example). > I did understand the sequence of operations, but didn't see the point about newly generated NAs. I agree that this makes it necessary to remove NAs last. > There is an alternative, that poly() works only on finite > values and passes through non-finite ones, but it was > deliberately not written that way and you will need to > convince everyone that would be a better solution. (What > happens for example if there are only two finite values?) If > you want such a function, it is easy for you to provide it: > just please don't call it poly(). [Writing poly() was not > easy BTW, but poly_allow_NAs would be given poly.] > I'm no longer convinced that what I proposed is a better solution, but the current situation seems problematic to me as well. Thanks for the extended explanation. John > -- > Brian D. Ripley, [EMAIL PROTECTED] > Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ > University of Oxford, Tel: +44 1865 272861 (self) > 1 South Parks Road, +44 1865 272866 (PA) > Oxford OX1 3TG, UK Fax: +44 1865 272595 > > ______________________________________________ > R-devel@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel ______________________________________________ R-devel@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-devel