I am working on a new package, one in which the user needs to specify the role that different variables play in the analysis. Where I'm stumped is the best way to have users specify those roles.
Approach #1: Separate formula for each special component First I thought to have users specify each formula separately, like: new.function(formula=y~X1+X2+X3, weights=~w, observationID=~ID, strata=~site, data=mydata) This seems to be a common approach in other packages. However, one of my testers noted that if he put formula=y~. then w, ID, and site showed up in the model where they weren't supposed to be. I could add some code to try to prevent that (string matching and editing the terms object, perhaps?), but that seemed a little clumsy to me. Approach #2: Create specials to label special variables So I turned to the user interface design in coxph where the user can specify strata and cluster in a single formula. So my approach would look something like: new.function(formula=y~weights(w)+strata(site)+observationID(ID)+X1+X2+X3, data=mydata) My aim would be that the user could use a dot instead of X1+X2+X3 and the dot would not expand to include w, site, and ID. However, at least as implemented in coxph(), this approach does not handle the dot in the formula any better than the first approach. Call: coxph(formula = Surv(time, status) ~ strata(sex) + ., data = test1) coef exp(coef) se(coef) z p x 0.802 2.23 0.822 0.976 0.33 sex NA NA 0.000 NA NA Surely the user wants the dot to mean all the other variables but not the ones that are already in the model, like sex. I could also develop some code (again perhaps clumsily) to search after the fact for variables (like sex) that shouldn't be in there. Approach #3: Require the user to first describe a separate study design object Lastly I looked at the design for the survey package. This package first requires the user to create an object that describes the key components of the dataset. So I would have the user do something like this: mystudy <- study.design(weights=~w, observationID=~ID, strata=~site, data=mydata) myresults <- doanalysis(formula=y~X1+X2+X3, design=mystudy) But it seems that the survey package is also not designed to handle the dot. data(api) dstrat<-svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc) svyglm(api00~., design=dstrat) Error in svyglm.survey.design(api00 ~ ., design = dstrat) : all variables must be in design= argument Does anyone have advice on how best to handle this? 1. Tell my tester "Tough, you can't use dots in a formula in my package".essentially what the survey package seems to do. Encourage the use of survey::make.formula()? 2. Fix Approach #1 to search for duplicates in the weights, observation ID, and strata parameters. Any elegant ways to do that? 3. Fix Approach #2, the coxph style, to try to remove redundant covariates. Not sure if there's a graceful way not involving string matching 4. Any existing elegant approaches to interpreting the dot? Or should I just do string matching to delete duplicate variables from the terms object. Thanks, Greg Greg Ridgeway Associate Professor University of Pennsylvania ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel