Heinz Tuechler wrote: > At 20:39 14.07.2006 -0500, Frank E Harrell Jr wrote: >> Heinz Tuechler wrote: >>> At 11:02 13.07.2006 -0500, Frank E Harrell Jr wrote: >>>> Heinz Tuechler wrote: >>>>> At 08:11 13.07.2006 -0500, Frank E Harrell Jr wrote: >>>>>> Heinz Tuechler wrote: >>>>>>> At 13:14 12.07.2006 -0500, Marc Schwartz (via MN) wrote: >>>>>>>> On Wed, 2006-07-12 at 17:41 +0100, Jol, Arne wrote: >>>>>>>>> Dear R, >>>>>>>>> >>>>>>>>> I import data from spss into a R data.frame. On this rawdata I do > some >>>>>>>>> data processing (selection of observations, normalization, > recoding of >>>>>>>>> variables etc..). The result is stored in a new data.frame, > however, in >>>>>>>>> this new data.frame the value labels are lost. >>>>>>>>> >>>>>>>>> Example of what I do in code: >>>>>>>>> >>>>>>>>> # read raw data from spss >>>>>>>>> rawdata <- read.spss("./data/T50937.SAV", >>>>>>>>> use.value.labels=FALSE,to.data.frame=TRUE) >>>>>>>>> >>>>>>>>> # select the observations that we need >>>>>>>>> diarydata <- rawdata[rawdata$D22==2 | rawdata$D22==3 | >>> rawdata$D22==17 | >>>>>>>>> rawdata$D22==18 | rawdata$D22==20 | rawdata$D22==22 | >>>>>>>>> rawdata$D22==24 | rawdata$D22==33,] >>>>>>>>> >>>>>>>>> The result is that rawdata$D22 has value labels and that > diarydata$D22 >>>>>>>>> is numeric without value labels. >>>>>>>>> >>>>>>>>> Question: How can I prevent this from happening? >>>>>>>>> >>>>>>>>> Thanks in advance! >>>>>>>>> Groeten, >>>>>>>>> Arne >>>>>>>> Two things: >>>>>>>> >>>>>>>> 1. With respect to your subsetting, your lengthy code can be replaced >>>>>>>> with the following: >>>>>>>> >>>>>>>> diarydata <- subset(rawdata, D22 %in% c(2, 3, 17, 18, 20, 22, 24, > 33)) >>>>>>>> See ?subset and ?"%in%" for more information. >>>>>>>> >>>>>>>> >>>>>>>> 2. With respect to keeping the label related attributes, the >>>>>>>> 'value.labels' attribute and the 'variable.labels' attribute will > not by >>>>>>>> default survive the use of "[".data.frame in R (see ?Extract >>>>>>>> and ?"[.data.frame"). >>>>>>>> >>>>>>>> On the other hand, based upon my review of ?read.spss, the SPSS value >>>>>>>> labels should be converted to the factor levels of the respective >>>>>>>> columns when 'use.value.labels = TRUE' and these would survive a >>>>>>>> subsetting. >>>>>>>> >>>>>>>> If you want to consider a solution to the attribute subsetting issue, >>>>>>>> you might want to review the following post by Gabor Grothendieck in >>>>>>>> May, which provides a possible solution: >>>>>>>> >>>>>>>> https://stat.ethz.ch/pipermail/r-help/2006-May/106308.html >>>>>>>> >>>>>>>> and this post by me, for an explanation of what is happening in > Gabor's >>>>>>>> solution: >>>>>>>> >>>>>>>> https://stat.ethz.ch/pipermail/r-help/2006-May/106351.html >>>>>>>> >>>>>>>> HTH, >>>>>>>> >>>>>>>> Marc Schwartz >>>>>>>> >>>>>>> Hello Mark and Arne, >>>>>>> >>>>>>> I worked on the suggestions of Gabor and Mark and programmed some >>> functions >>>>>>> in this way, but they are very, very preliminary (see below). >>>>>>> In my view there is a lack of convenient possibilities in R to document >>>>>>> empirical data by variable labels, value labels, etc. I would prefer to >>>>>>> have these possibilities in the "standard" configuration. >>>>>>> So I sketched a concept, but in my view it would only be useful, if > there >>>>>>> was some acceptance by the core developers of R. >>>>>>> >>>>>>> The concept would be to define a class. For now I call it > "source.data". >>>>>>> To design it more flexible than the Hmisc class "labelled" I would >>> define a >>>>>>> related option "source.data.attributes" with default c('value.labels', >>>>>>> 'variable.name', 'label')). This option contains all attributes that >>> should >>>>>>> persist in subsetting/indexing. >>>>>>> >>>>>>> I made only some very, very preliminary tests with these functions, >>> mainly >>>>>>> because I am not happy with defining a new class. Instead I would > prefer, >>>>>>> if this functionality could be integrated in the Hmisc class > "labelled", >>>>>>> since this is in my view the best known starting point for data >>>>>>> documentation in R. >>>>>>> >>>>>>> I would be happy, if there were some discussion about the > wishes/needs of >>>>>>> other Rusers concerning data documentation. >>>>>>> >>>>>>> Greetings, >>>>>>> >>>>>>> Heinz >>>>>> I feel that separating variable labels and value labels and just using >>>>>> factors for value labels works fine, and I would urge you not to create >>>>>> a new system that will not benefit from the many Hmisc functions that >>>>>> use variable labels and units. [.data.frame in Hmisc keeps all >>> attributes. >>>>>> Frank >>>>>> >>>>> Frank, >>>>> >>>>> of course I aggree with you about the importance of Hmisc and as I > said, I >>>>> do not want to define a new class, but in my view factors are no good >>>>> substitute for value labels. >>>>> As the language definition (version 2.3.1 (2006-06-05) Draft, page 7) > says: >>>>> "Factors are currently implemented using an integer array to specify the >>>>> actual levels and a second array of names that are mapped to the > integers. >>>>> Rather unfortunately users often make use of the implementation in > order to >>>>> make some calculations easier." >>>>> So, in my view, the levels represent the "values" of the factor. >>>>> This has inconveniencies if you want to use value labels in different >>>>> languages. Further I do not see a simple method to label numerical >>>>> variables. I often encounter discrete, but still metric data, as e.g. > risk >>>>> scores. Usually it would be nice to use them in their original coding, >>>>> which may include zero or decimal places and to label them at the same >>> time. >>>>> Personally at the moment I try to solve this problem by following a >>>>> suggestion of Martin, Dimitis and others to use names instead. I doubt, >>>>> however, that this is a good solution, but at least it makes it > possible to >>>>> have the source data numerically coded and in this sense "language free" >>>>> (see first attempts of functions below). >>>>> >>>>> Heinz >>>>> >>>> Those are excellent points Heinz. I addressed that problem partially in >>>> sas.get - see the sascodes attribute. >>>> >>>> Frank >>>> >>> Frank, I looked at your function sas.get. You solved the problem with a lot >>> of effort. Don't you think that it would be easier to create just one new >>> class, say "documented", which offers the possibility to represent the >>> original data as it is and to add all the useful descriptions like variable >>> labels, value labels, units, special missing values, and may be others. >>> If I remember correctly SAS, SPSS and BMDP offer these possibilities since >>> many years, and in my view for good reason. I am thinking about this >>> questions since I started using R about two years ago and I wonder, why >>> there seems to be so little interest in these questions. >>> In my work good documentation of the _unchanged_ data is very important, >>> also because it eases checking the data for errors. >>> >>> Heinz >>> >>> >>>>> ...snip... >>> >>> >> Heinz - the code is quite small and simple, not much effort. And >> variable labels need to be attributes to individual variables, otherwise >> plotting, latex, and other functions can't get access to them (e.g., >> in Hmisc xYplot(y ~ x) labels for x and y, and units of measurement, get >> plotted on axes. I've been having all the SAS, SPSS, and BMDP >> capabilities you've mentioned in R/S-Plus (plus units attributes not >> available in those) for years. >> >> What would make all this even easier is for R to be told a list of >> attribute names that would always carry with subsetting, so that >> specially subsetting methods such as [.labeled would not be necessary. >> >> Frank >> >> -- >> Frank E Harrell Jr Professor and Chair School of Medicine >> Department of Biostatistics Vanderbilt University >> >> > > Frank - maybe I did not understand you right, but it seems that you propose > exactly what I did initially. Yes, I aggree with you that it would ease the > situation, if there were a list of respected attributes. However, I suspect > that it could be a computational burden to copy these attributes in any > case. So I would suggest to define a class that typically would be assigned > to raw data and to define an option that sets all the attributes which > should be copied. > Would you think this issue could/should be discussed in r-devel?
Yes r-devel would be the place. In retrospect a single attribute such as varExtras would have been good - it could contain label, units, etc. But my functions are too well established for me to change now. I'd have to change too much code. Frank > > Heinz ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html