My students are working with several SPSS dataset provided by the European Social Survey. If you register your name, you can download it too. This is the 2004 data, for example:
http://ess.nsd.uib.no/ess/round2/ I cannot give you the European Survey dataset, but you can download it for free if you like, and then you could run these commands to re-produce this weird pattern described below. library(foreign) d2 <- read.spss("ESS3e03_2.por") warnings() str(d2$HAPPY) d2 <- as.data.frame(d2) str(d2$HAPPY) d2 <- read.spss("ESS3e03_2.por",to.data.frame=T) warnings() str(d2$HAPPY) Here's my info for this example: > sessionInfo() R version 2.10.0 (2009-10-26) x86_64-pc-linux-gnu locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] foreign_0.8-38 The weirdness that follows is the difference between d2 <- read.spss( ... , to.data.frame=T) and d2 <- read.spss () d2 <- as.data.frame(d2) The former causes all data to become <NA> but the latter seems mostly OK. > library(foreign) > d2 <- read.spss("ESS3e03_2.por") warnings() There were 12 warnings (use warnings() to see them) > Warning messages: 1: In `levels<-`(`*tmp*`, value = c("CENTRUMP", "", "FIDESZ", ... : duplicated levels will not be allowed in factors anymore 2: In `levels<-`(`*tmp*`, value = c("CENTRUMP", "", "FIDESZ", ... : duplicated levels will not be allowed in factors anymore 3: In `levels<-`(`*tmp*`, value = c("Refusal", "Don't know", ... : duplicated levels will not be allowed in factors anymore 4: In `levels<-`(`*tmp*`, value = c("No second language mentioned", ... : duplicated levels will not be allowed in factors anymore 5: In `levels<-`(`*tmp*`, value = c("Sans dipl", "Non dipl", ... : duplicated levels will not be allowed in factors anymore 6: In `levels<-`(`*tmp*`, value = c("\"Ej avslutad folkskola/grundskola\"", ... : duplicated levels will not be allowed in factors anymore 7: In `levels<-`(`*tmp*`, value = c("Armed forces", "Legislators, senior officials and managers", ... : duplicated levels will not be allowed in factors anymore 8: In `levels<-`(`*tmp*`, value = c("Armed forces", "Legislators, senior officials and managers", ... : duplicated levels will not be allowed in factors anymore 9: In `levels<-`(`*tmp*`, value = c("K", "K", "Frederiksborg Amt", ... : duplicated levels will not be allowed in factors anymore 10: In `levels<-`(`*tmp*`, value = c("P", "L", "Kesk-Eesti", ... : duplicated levels will not be allowed in factors anymore 11: In `levels<-`(`*tmp*`, value = c("Galicia", "Principado de Asturias", ... : duplicated levels will not be allowed in factors anymore 12: In `levels<-`(`*tmp*`, value = c("Stockholm", "", "Sydsverige", ... : duplicated levels will not be allowed in factors anymore > str(d2$HAPPY) Factor w/ 14 levels "Extremely unhappy",..: 9 7 9 11 9 6 9 4 13 8 ... > d2 <- as.data.frame(d2) > str(d2$HAPPY) Factor w/ 14 levels "Extremely unhappy",..: 9 7 9 11 9 6 9 4 13 8 ... That appears valid. On my first effort, I had tried to get the data frame in a single shot with read.spss > d2 <- read.spss("ESS3e03_2.por",to.data.frame=T) There were 15 warnings (use warnings() to see them) > warnings() Warning messages: 1: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] : longer object length is not a multiple of shorter object length 2: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] : longer object length is not a multiple of shorter object length 3: In xi >= z[1L] | xi <= z[2L] | xi[xi == z[3L]] : longer object length is not a multiple of shorter object length 4: In `levels<-`(`*tmp*`, value = c("CENTRUMP", "", "FIDESZ", ... : duplicated levels will not be allowed in factors anymore 5: In `levels<-`(`*tmp*`, value = c("CENTRUMP", "", "FIDESZ", ... : duplicated levels will not be allowed in factors anymore 6: In `levels<-`(`*tmp*`, value = c("Refusal", "Don't know", ... : duplicated levels will not be allowed in factors anymore 7: In `levels<-`(`*tmp*`, value = c("No second language mentioned", ... : duplicated levels will not be allowed in factors anymore 8: In `levels<-`(`*tmp*`, value = c("Sans dipl", "Non dipl", ... : duplicated levels will not be allowed in factors anymore 9: In `levels<-`(`*tmp*`, value = c("\"Ej avslutad folkskola/grundskola\"", ... : duplicated levels will not be allowed in factors anymore 10: In `levels<-`(`*tmp*`, value = c("Armed forces", "Legislators, senior officials and managers", ... : duplicated levels will not be allowed in factors anymore 11: In `levels<-`(`*tmp*`, value = c("Armed forces", "Legislators, senior officials and managers", ... : duplicated levels will not be allowed in factors anymore 12: In `levels<-`(`*tmp*`, value = c("K", "K", "Frederiksborg Amt", ... : duplicated levels will not be allowed in factors anymore 13: In `levels<-`(`*tmp*`, value = c("P", "L", "Kesk-Eesti", ... : duplicated levels will not be allowed in factors anymore 14: In `levels<-`(`*tmp*`, value = c("Galicia", "Principado de Asturias", ... : duplicated levels will not be allowed in factors anymore 15: In `levels<-`(`*tmp*`, value = c("Stockholm", "", "Sydsverige", ... : duplicated levels will not be allowed in factors anymore > str(d2$HAPPY) Factor w/ 13 levels "Extremely unhappy",..: NA NA NA NA NA NA NA NA NA NA ... Oh, heck, all the values are missing!! Somehow, putting "to.data.frame" inside the read.spss causes a different outcome than using as.data.frame after reading in the data. The symptoms of this in R-2.9 are a little different, but the conclusion the same. Help? In case you are a student who wants to work with this data, I can share to you the large script that I have been accumulating so that you might "play along". It turns out to be surprisingly difficult to "recode" these factor variables that have levels like "none", "1", "2",..."9", "total". ## Paul Johnson ## November 13, 2009 ## A question arose in the lab. A student asks "I want ## to compare the answers from two different editions ## of the European Social Survey. ## I will add this to Stuff Worth Knowing later, but ## I can share this tutorial to you right now. ## From this website: ## http://ess.nsd.uib.no/ess ## Download those European Social Survey Datasets into a directory. ## In a terminal, use the unzip command: ## unzip ESS3e03_2.spss.zip ## unzip ESS2e03_1.spss.zip ## Then run the following in R. library(foreign) d2 <- read.spss("ESS3e03_2.por",to.data.frame=T) d2 <- read.spss("ESS3e03_2.por") warnings() ### You can try to go into a data frame in one ### step, that's an option in read.spss. But ### we saw warnings, and wanted to be careful. d2 <- as.data.frame(d2) d2$whichSurvey <- 2 d3 <- read.spss("ESS2e03_1.por") d3 <- as.data.frame(d3) d3$whichSurvey <- 3 namesd2 <- names(d2) namesd3 <- names(d3) commonNames <- intersect( namesd3, namesd2) combod23 <- rbind(d2[ , commonNames], d3[, commonNames]) save(combod23, file="combod23.Rda") ## Error ##Warning messages: ##1: In `[<-.factor`(`*tmp*`, ri, value = c(NA, NA, NA, NA, NA, NA, NA, : ## invalid factor level, NAs generated ##2: In `[<-.factor`(`*tmp*`, ri, value = c(NA, NA, NA, NA, NA, NA, NA, : ## invalid factor level, NAs generated ##3: In `[<-.factor`(`*tmp*`, ri, value = c(1, 1, 1, 1, 1, 1, 1, 1, 1, : ## invalid factor level, NAs generated ## That worries me a little bit. The warnings did too. ## Inspect a few lines in the result. combod23[1:4, ] ## fix doesn't work for me, did not bother to investigate. ##> fix(combod23) ##Error in edit.data.frame(get(subx, envir = parent), title = subx, ...) : ## can only handle vector and factor elements ## That means some data from hell came into this thing. ## I suspect that combod23 is OK. ## The memory use on this exercise is huge! Try to help it rm (d2) rm (d3) ## But I worry. I have 2 ways that I use to try to figure this ## out. One is to open the dataset in a clone of SPSS called ## "PSPP". Actually, the executable is "psppire". ## ## The other thing I do is open the same data again in ## a numeric format, and compare the 2 combined data frames ## This is also a useful exercise because it helps you ## understand what a "factor" is in R. dn2 <- read.spss("ESS3e03_2.por", use.value.labels = F) dn2 <- as.data.frame(dn2) dn2$whichSurvey <- 2 dn3 <- read.spss("ESS2e03_1.por", use.value.labels = F) dn3 <- as.data.frame(dn3) dn3$whichSurvey <- 3 ## Might be smart to compare # dn2$HAPPY[1:50] # d2$HAPPY[1:50] namesdn2 <- names(dn2) namesdn3 <- names(dn3) commonNNames <- intersect( namesdn3, namesdn2 ) combodn23 <- rbind(dn2[ , commonNNames], dn3[, commonNNames]) save(combodn23, file="combodn23.Rda") table( combod23$HAPPY, combodn23$HAPPY) ## In summary, whenever I want to use a variable from ## the combined data frame, I would probably compare ## against combodn23 just to feel safe. ## Note, after when you come back to work on this project again, you ## might as well just reload the saved copies of combod23 and ## combodn23. ## load("combod23.Rda") ## load("combodn23.Rda") ## That will put you at the current spot, no need to redo the merge ## Now, about "recoding". If you just want numerical ## data, you might consider using combodn23. ## But if you want some factors and some numberical ## variables, then you might need to recode to reclaim ## values. ## HAPPY turns out to be an interesting example of a ## PAIN IN THE ASS because in SPSS, it is scored from ## 0 to 10, but they give value labels only for scores ## 1= Extremely unhappy ## and ## 10= Extremely happy ## ## And the SPSS column has no labels for values 1-9. ## If SPSS gave NO labels at all, then this would come ## into R as a numeric variable. BUT, because there are ## 2 levels named, then R makes a factor out of it. ## When R turns it into a factor, you ## end up with a nutty looking factor, which has ## levels you don't really appreciate. levels(combod23$HAPPY) # [1] "Extremely unhappy" "1" "2" # [4] "3" "4" "5" # [7] "6" "7" "8" #[10] "9" "Extremely happy" "Refusal" #[13] "Don't know" "No answer" ## Create a new variable to play with combod23$HAPPY2 <- combod23$HAPPY ## Change Extremely Unhappy to text "0" levels(combod23$HAPPY)[1] <- "0" ## Change Extremely Happy to "10" levels(combod23$HAPPY)[11] <- "10" HELL <- levels(combod23$HAPPY) ### Look at HELL HELL combod23$HAPPY2[combod23$HAPPY %in% HELL[12:14] ] <- NA ##CHECK RESULT table(combod23$HAPPY, combod23$HAPPY2) ## Eliminate the unused levels from HAPPY2 combod23$HAPPY2 <- factor(combod23$HAPPY2) ### Same is found with ## combo23$HAPPY2 <- combo23$HAPPY2[ , drop=T] ## Use the "factor trick" to ## reset the variable back to numeric: combod23$HAPPYN <- as.numeric(HELL)[combod23$HAPPYN] ##CHECK RESULT table(combod23$HAPPY, combod23$HAPPY2) ## CHECK by comparing against numeric data from spss table(combodn23$HAPPY, combod23$HAPPYN) ## Next, a student asks "how can I make that same recode ## on a lot of variables?" I'm going to have to leave ## that one unanswered. I think the answer will probably ## be to get a list of variables, then use "lapply" to ## do the same thing to each variable in turn. But ## I have not written up a simple, understandable example ## yet ## After the data is all recoded and homogenized, then we ## could run any analysis we want, and throw in the variable ## "whichSurvey" to see if there is a difference beteween the ## two models. ## Example, choose your y and x1 and x2, then ## mod <- lm(y~ (x1+x2)*whichSurvey, data=combod23) ## or if you think the difference is just in the intercept: ## mod <- lm(y~ x1+x2 + whichSurvey, data=combod23) -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.