I find that Harrell's describe ( Hmisc) provides some of that desired
functionality. When I am creating a paper codebook I will print the
results of describe function fro a dataframe to create an overview
snapshot and will post a copy of str(dfname) on the wall.
As his help page says:
"describe is especially useful for describing data frames created by
*.get, as labels, formats, value labels, and (in the case of sas.get)
frequencies of special missing values are printed."
I believe that Frank has developed some functions to replicate SAS's
subtyping of NA values, although I have not explored such facilities.
I also find that summary(dfname) provides some useful information that
describe does not.
--
David.
On Oct 28, 2009, at 1:27 PM, Jacob Wegelin wrote:
Often it is useful to keep a "codebook" to document the contents of
a dataset. (By "dataset" I mean
a rectangular structure such as a dataframe.)
The codebook has as many rows as the dataset has columns (variables,
fields). The columns (fields)
of the codebook may include:
• variable name
• type (character, factor, integer, etc)
• variable label (e.g., a variable called "bmi2" might be
labeled "BMI hand-input by
clinic personnel, must be checked"
• permissible values
• which values indicate missing (and potentially different
kinds of missing)
Some statistics software (e.g., SPSS and Stata) provides at least a
subset of this kind of
information automatically in a convenient form. For instance, in
Stata one can define a "label" for
a variable and it is thenceforth linked to the variable. In output
from certain modeling and
graphics functions, Stata by default uses the label rather than the
variable name.
Furthemore: In Stata, if "myvariable" is labeled numeric (in R
lingo, a factor), and I type
codebook myvariable
then Stata tells me, among other things, the "levels" of myvariable.
Does a tool of this sort exist in R?
The prompt() function is related to this, but prompt(someDataFrame)
creates a text file on disk. The
text file is associated with, but not unambiguously linked to,
someDataFrame.
The epicalc function codebook() provides a summary of a dataframe
similar to that created by
summary() but easier to read. But this is not a way to define and
keep track of labels that are
linked to variables.
To link a dataframe to its codebook, one could do the following "by
hand": Create a list, say,
"somedata", where somedata$DATA is a dataframe that contains the
data, and somedata$VARIABLE is also
a dataframe, but serves as the codebook. For instance, the following
function creates a template
into which one could subsequently edit to insert variable labels and
turn into somedata$VARIABLE.
fnJunk <-function( THESEDATA ) {
# From a dataframe, make the start of a codebook.
if(!is.data.frame(THESEDATA)) stop("!is.data.frame(THESEDATA)")
data.frame(
Variable=names(THESEDATA)
, class=sapply(THESEDATA, class)
, type=sapply(THESEDATA, typeof)
, label=""
, comment=""
)
}
But the following automatic behavior would be nice:
• We should be able to treat somedata exactly as we treat a
dataframe, so that the
fact that it possesses a "codebook" is merely an added benefit, not
an interference with the
usual tasks.
• If we delete a column of somedata$DATA, the associated row
of somedata$VARIABLE
should be automatically deleted.
• If we add a column to somedata$DATA, the associated column
should be inserted in
somedata$VARIABLE, and some of the fields automatically populated
such as variable name and
type. It could get fancier. For instance:
• If we try to add a value to a field in somedata$DATA which
is not permitted by the
"permissible values" listed for this field in somedata$VARIABLE, we
get an error.
Has anyone already thought this through, maybe defined a class and
associated methods?
Thanks
Jacob A. Wegelin
Assistant Professor
Department of Biostatistics
Virginia Commonwealth University
730 East Broad Street Room 3006
P. O. Box 980032
Richmond VA 23298-0032
U.S.A. E-mail: jwege...@vcu.edu URL:
http://www.people.vcu.edu/~jwegelin______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD
Heritage Laboratories
West Hartford, CT
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.