Thanks, Joris, This clarifies at least where exactly it comes from. I still find the high-level behavior of 'predict' very counter-intuitive as the estimated model contains no coefficients in 'z', but I think we agree on that.
I am not sure how much trouble it would be to improve this behavior, but perhaps one of the core authors can have a look at it. Best, Mark Op vr 16 mrt. 2018 om 13:22 schreef Joris Meys <jorism...@gmail.com>: > Technically it is used as a predictor in the model. The information is > contained in terms : > > > terms(x ~ . - z, data = d) > x ~ (y + z) - z > attr(,"variables") > list(x, y, z) > attr(,"factors") > y > x 0 > y 1 > z 0 > attr(,"term.labels") > [1] "y" > attr(,"order") > [1] 1 > attr(,"intercept") > [1] 1 > attr(,"response") > [1] 1 > attr(,".Environment") > > And the model.frame contains it : > > > head(model.frame(x ~ . - z, data = d)) > x y z > 2 -0.06022984 -0.4483109 b > 3 1.25293390 0.2687065 c > 4 -1.11811090 0.8016076 d > 5 -0.75521720 -0.7484931 e > 6 0.93037156 0.4128456 f > 7 1.32052028 -1.6609043 a > > It is at the construction of the model.matrix that z disappears, but the > contrasts information for z is still attached : > > > attr(model.matrix(x ~ . - z, data = d),"contrasts") > $z > [1] "contr.treatment" > > As you can see from the error you printed, it is model.frame() complaining > about it. In this case it wouldn't be necessary, but it is documented > behaviour of model.frame. Which is why I didn't say "this is not a bug", > but "this is not a bug per se". Meaning that this is not optimal behaviour > and might not what you expect, but it follows the documentation of the > underlying functions. > > Solving it would require a bypass of model.frame() to construct the > correct model,matrix for the new predictions, and that's far from trivial > as model.matrix() itself depends on model.frame(). > > Cheers > Joris > > > > > > On Fri, Mar 16, 2018 at 1:09 PM, Mark van der Loo < > mark.vander...@gmail.com> wrote: > >> Joris, the point is that 'z' is NOT used as a predictor in the model. >> Therefore it should not affect predictions. Also, I find it suspicious that >> the error only occurs when the response variable conitains missings and 'z' >> is unique (I have tested several other cases to confirm this). >> >> -Mark >> >> Op vr 16 mrt. 2018 om 13:03 schreef Joris Meys <jorism...@gmail.com>: >> >>> It's not a bug per se. It's the effect of removing all observations >>> linked to a certain level in your data frame. So the output of lm() doesn't >>> contain a coefficient for level a of z, but your new data contains that >>> level a. With a small addition, this works again: >>> >>> d <- data.frame(x=rnorm(12),y=rnorm(12),z=rep(letters[1:6],2)) >>> >>> d$x[1] <- NA >>> m <- lm(x ~ . -z, data=d) >>> p <- predict(m, newdata=d) >>> >>> This is linked to another discussion earlier on stackoverflow : >>> https://stackoverflow.com/questions/48461980/prediction-in-r-glmm >>> which lead to an update to lme4 : >>> https://github.com/lme4/lme4/issues/452 >>> >>> The point being that factors in your newdata should have the same levels >>> as factors in the original data that was used to fit the model. If you add >>> levels to these factors, it's impossible to use that model to predict for >>> these new data. >>> >>> Cheers >>> Joris >>> >>> On Fri, Mar 16, 2018 at 10:21 AM, Mark van der Loo < >>> mark.vander...@gmail.com> wrote: >>> >>>> Dear R-developers, >>>> >>>> In the 'lm' documentation, the '-' operator is only specified to be used >>>> with -1 (to remove the intercept from the model). >>>> >>>> However, the documentation also refers to the 'formula' help file, which >>>> indicates that it is possible to subtract any term. Indeed, the >>>> following >>>> works with no problems (the period '.' stands for 'all terms except the >>>> lhs'): >>>> >>>> d <- data.frame(x=rnorm(6), y=rnorm(6), z=letters[1:2]) >>>> m <- lm(x ~ . -z, data=d) >>>> p <- predict(m,newdata=d) >>>> >>>> Now, if I change 'z' so that it has only unique values, and I introduce >>>> an >>>> NA in the predicted variable, the following happens: >>>> >>>> d <- data.frame(x=rnorm(6),y=rnorm(6),z=letters[1:6]) >>>> d$x[1] <- NA >>>> m <- lm(x ~ . -z, data=d) >>>> p <- predict(m, newdata=d) >>>> Error in model.frame.default(Terms, newdata, na.action = na.action, >>>> xlev = >>>> object$xlevels) : factor z has new levels a >>>> >>>> It seems a bug to me, although one could argue that 'lm's documentation >>>> does not allow one to expect that the '-' operator should work >>>> generally. >>>> >>>> If it is a bug I'm happy to report it to bugzilla. >>>> >>>> Thanks for all your efforts, >>>> Mark >>>> >>>> ps: I was not able to test this on R3.4.4 yet, but the NEWS does not >>>> mention fixes related to 'lm' or 'predict'. >>>> >>>> >>>> > sessionInfo() >>>> R version 3.4.3 (2017-11-30) >>>> Platform: x86_64-pc-linux-gnu (64-bit) >>>> Running under: Ubuntu 16.04.4 LTS >>>> >>>> Matrix products: default >>>> BLAS: /usr/lib/libblas/libblas.so.3.6.0 >>>> LAPACK: /usr/lib/lapack/liblapack.so.3.6.0 >>>> >>>> locale: >>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>>> LC_TIME=nl_NL.UTF-8 LC_COLLATE=en_US.UTF-8 >>>> [5] LC_MONETARY=nl_NL.UTF-8 LC_MESSAGES=en_US.UTF-8 >>>> LC_PAPER=nl_NL.UTF-8 LC_NAME=C >>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>> LC_MEASUREMENT=nl_NL.UTF-8 LC_IDENTIFICATION=C >>>> >>>> attached base packages: >>>> [1] stats graphics grDevices utils datasets methods base >>>> >>>> loaded via a namespace (and not attached): >>>> [1] compiler_3.4.3 tools_3.4.3 yaml_2.1.16 >>>> >>>> [[alternative HTML version deleted]] >>>> >>>> ______________________________________________ >>>> R-devel@r-project.org mailing list >>>> https://stat.ethz.ch/mailman/listinfo/r-devel >>>> >>> >>> >>> >>> -- >>> Joris Meys >>> Statistical consultant >>> >>> Department of Data Analysis and Mathematical Modelling >>> Ghent University >>> Coupure Links 653, B-9000 Gent (Belgium) >>> >>> <https://maps.google.com/?q=Coupure+links+653,%C2%A0B-9000+Gent,%C2%A0Belgium&entry=gmail&source=g> >>> >>> ----------- >>> Biowiskundedagen 2017-2018 >>> http://www.biowiskundedagen.ugent.be/ >>> >>> ------------------------------- >>> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php >>> >> > > > -- > Joris Meys > Statistical consultant > > Department of Data Analysis and Mathematical Modelling > Ghent University > Coupure Links 653, B-9000 Gent (Belgium) > > <https://maps.google.com/?q=Coupure+links+653,%C2%A0B-9000+Gent,%C2%A0Belgium&entry=gmail&source=g> > > ----------- > Biowiskundedagen 2017-2018 > http://www.biowiskundedagen.ugent.be/ > > ------------------------------- > Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php > [[alternative HTML version deleted]] ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel