In our recent workshop on "Multilevel Modeling in R" we discussed handling data for multilevel modeling. An classic example of such data are test scores of students grouped into schools. We may wish to model the scores as functions of both student-level covariates and school-level covariates.
Such data are often organized in a multi-table format with a separate table for each level of information. The MathAchieve and MathAchSchool data frames in the nlme package are examples of such an organization. The HLM software requires the data to be organized like this. To fit a model in R we need to create a composite table by "joining" the columns of the student-level and school-level tables, in the relational database sense of "join". I have created a function to join the columns from two such frames according to the values of a key column. In relational database terms the key column must be a primary key for the second frame. I have called this function 'cjoin', by analogy to cbind. You can try data(MathAchieve, package = 'nlme') data(MathAchSchool, package = 'nlme') cjoin(MathAchieve, MathAchSchool, "School") cjoin(MathAchieve, MathAchSchool, "School", which = "Sector") as examples Several questions: - Am I duplicating existing functionality? - Is cjoin a good name for such a function? - Would this be useful in base?
"cjoin" <- function(fr1, fr2, cnm1, cnm2 = cnm1, which) { val = as.data.frame(fr1) vnms = names(val) cnm1 = as.character(cnm1) if (length(cnm1) < 1 || any(is.na(match(cnm1, vnms)))) { stop("cnm1 must be one or more column names from fr1") } fr2nm = names(fr2) if (!missing(cnm2)) { cnm2 = as.character(cnm2) if (length(cnm2) != length(cnm1) || any(is.na(match(cnm2, fr2nm)))) { stop(paste("cnm2 must be", length(cnm1), "column name(s) from fr2")) } } else { if (any(missed <- is.na(match(cnm2, fr2nm)))) { stop(paste("No columns named", paste(cnm2[missed], sep = ", "), "in fr2")) } } if (length(cnm1) == 1) { mcol = fr2[[cnm2]] if (any(is.na(mcol))) { warning(paste("Missing values in column", mcol, "of fr2 cannot be matched")) mcol = mcol[is.na(mcol)] } if (length(mcol) != length(unique(mcol))) { stop(paste("column", mcol, "must be a unique key to fr2")) } mm = match(fr1[[cnm1]], fr2[[cnm2]]) if (any(is.na(mm))) { stop("Values present in column", cnm1, "of fr1 are not present in column", cnm2, "of fr2") } if (missing(which)) { which = fr2nm[!(fr2nm %in% cnm2)] } else { which = which[!(which %in% cnm2)] if (any(missed <- is.na(match(which, fr2nm)))) { stop(paste("No columns named", paste(which[missed], sep = ", "), "in fr2")) } } if (any(dups <- !is.na(match(which, vnms)))) { warning(paste("Join operation will duplicate column name(s)", paste(which[dups], sep = ", "))) } return(cbind(val, fr2[mm, which])) } else { stop("Matches on multiple columns not yet implemented") } }
______________________________________________ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-devel