Suppose X and Y are two data frames with the same structures, variable names and dimensions but with different data and different patterns of missing. I want to replace missing values in Y with corresponding values from X. I'll construct a simple two-by-two case:

X <- as.data.frame(matrix(c("a","b",1,2),2,2), stringsAsFactors=FALSE)
X[,2] <- as.integer(X[,2])
str(X)
'data.frame':   2 obs. of  2 variables:
  $ V1: chr  "a" "b"
  $ V2: int  1 2

Y <- as.data.frame(matrix(c("c","d",NA,4),2,2), stringsAsFactors=FALSE)
Y[,2] <- as.integer(Y[,2])
str(Y)
'data.frame':   2 obs. of  2 variables:
  $ V1: chr  "c" "d"
  $ V2: int  NA 4

This seems to be what I want to do...

Y[is.na(Y)] <- X[is.na(Y)]

...and it works except that the structure of Y is changed so that Y$V2 is now of type chr instead of type int:

str(Y)
'data.frame':   2 obs. of  2 variables:
  $ V1: chr  "c" "d"
  $ V2: chr  "1" "4"

This behavior makes sense because the vector X[is.na(Y)] is of the character type:

is.character(X[is.na(Y)])
[1] TRUE
str(X[is.na(Y)])
  chr "1"
X[is.na(Y)]
[1] "1"

The last couple of results seem weird at first. The "1" was originally an integer but now it is a character. This *must* be because the typing is done at an earlier stage in the process, back when R decides which elements of X have to be checked against the logical matrix is.na(Y). It then decides the type for the vector and only afterward does it find that only one of the four elements of X will be selected, but it was prepared from that early stage for any of the four, even all four of them, to be selected.

Suppose there were no NA elements in Y, what should we expect to see if we repeat what we did above?

Y <- as.data.frame(matrix(c("c","d",3,4),2,2), stringsAsFactors=FALSE)
Y[,2] <- as.integer(Y[,2])
str(Y)
'data.frame':   2 obs. of  2 variables:
  $ V1: chr  "c" "d"
  $ V2: int  3 4

Even though there are no elements in X[is.na(Y)], the null element is of type chr:

is.vector(X[is.na(Y)])
[1] TRUE
is.character(X[is.na(Y)])
[1] TRUE
str(X[is.na(Y)])
  chr(0)
X[is.na(Y)]
character(0)

So what happens if we do this...

Y[is.na(Y)] <- X[is.na(Y)]

...will it change the structure of Y so that Y$V2 becomes type chr?

str(Y)
'data.frame':   2 obs. of  2 variables:
  $ V1: chr  "c" "d"
  $ V2: int  3 4

No. I think there is an obvious reason for that: Y was not changed, and more specifically, Y$V2 was not changed, so no change was made to the variable types.

It all makes sense, but I want an easy way to maintain the structure of a data frame when I do this kind of operation. I ought to be able to do something like this:

Ytypes <- get_types(Y)

Y[is.na(Y)] <- X[is.na(Y)]

use_types(Y, Ytypes)

That kind of system would ensure that the basic structure of the data frame can be maintained. I don't want to have to check by hand, and sometimes it would be impossible to do so.

So what's the trick?  Is there a trick?

Mike

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to