Often it is useful to keep a "codebook" to document the contents of a dataset. (By "dataset" I mean a rectangular structure such as a dataframe.)
The codebook has as many rows as the dataset has columns (variables, fields). The columns (fields) of the codebook may include: • variable name • type (character, factor, integer, etc) • variable label (e.g., a variable called "bmi2" might be labeled "BMI hand-input by clinic personnel, must be checked" • permissible values • which values indicate missing (and potentially different kinds of missing) Some statistics software (e.g., SPSS and Stata) provides at least a subset of this kind of information automatically in a convenient form. For instance, in Stata one can define a "label" for a variable and it is thenceforth linked to the variable. In output from certain modeling and graphics functions, Stata by default uses the label rather than the variable name. Furthemore: In Stata, if "myvariable" is labeled numeric (in R lingo, a factor), and I type codebook myvariable then Stata tells me, among other things, the "levels" of myvariable. Does a tool of this sort exist in R? The prompt() function is related to this, but prompt(someDataFrame) creates a text file on disk. The text file is associated with, but not unambiguously linked to, someDataFrame. The epicalc function codebook() provides a summary of a dataframe similar to that created by summary() but easier to read. But this is not a way to define and keep track of labels that are linked to variables. To link a dataframe to its codebook, one could do the following "by hand": Create a list, say, "somedata", where somedata$DATA is a dataframe that contains the data, and somedata$VARIABLE is also a dataframe, but serves as the codebook. For instance, the following function creates a template into which one could subsequently edit to insert variable labels and turn into somedata$VARIABLE. fnJunk <-function( THESEDATA ) { # From a dataframe, make the start of a codebook. if(!is.data.frame(THESEDATA)) stop("!is.data.frame(THESEDATA)") data.frame( Variable=names(THESEDATA) , class=sapply(THESEDATA, class) , type=sapply(THESEDATA, typeof) , label="" , comment="" ) } But the following automatic behavior would be nice: • We should be able to treat somedata exactly as we treat a dataframe, so that the fact that it possesses a "codebook" is merely an added benefit, not an interference with the usual tasks. • If we delete a column of somedata$DATA, the associated row of somedata$VARIABLE should be automatically deleted. • If we add a column to somedata$DATA, the associated column should be inserted in somedata$VARIABLE, and some of the fields automatically populated such as variable name and type. It could get fancier. For instance: • If we try to add a value to a field in somedata$DATA which is not permitted by the "permissible values" listed for this field in somedata$VARIABLE, we get an error. Has anyone already thought this through, maybe defined a class and associated methods? Thanks Jacob A. Wegelin Assistant Professor Department of Biostatistics Virginia Commonwealth University 730 East Broad Street Room 3006 P. O. Box 980032 Richmond VA 23298-0032U.S.A. E-mail: jwege...@vcu.edu URL: http://www.people.vcu.edu/~jwegelin
______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.