[R] maintaining variable types in data frames

Mike Miller Thu, 22 Jan 2009 09:39:40 -0800

Suppose X and Y are two data frames with the same structures, variablenames and dimensions but with different data and different patterns ofmissing. I want to replace missing values in Y with corresponding valuesfrom X. I'll construct a simple two-by-two case:

X <- as.data.frame(matrix(c("a","b",1,2),2,2), stringsAsFactors=FALSE)
X[,2] <- as.integer(X[,2])
str(X)

'data.frame':   2 obs. of  2 variables:
  $ V1: chr  "a" "b"
  $ V2: int  1 2

Y <- as.data.frame(matrix(c("c","d",NA,4),2,2), stringsAsFactors=FALSE)
Y[,2] <- as.integer(Y[,2])
str(Y)

'data.frame':   2 obs. of  2 variables:
  $ V1: chr  "c" "d"
  $ V2: int  NA 4

This seems to be what I want to do...

Y[is.na(Y)] <- X[is.na(Y)]

...and it works except that the structure of Y is changed so that Y$V2 isnow of type chr instead of type int:

str(Y)

'data.frame':   2 obs. of  2 variables:
  $ V1: chr  "c" "d"
  $ V2: chr  "1" "4"

This behavior makes sense because the vector X[is.na(Y)] is of thecharacter type:

is.character(X[is.na(Y)])

[1] TRUE

str(X[is.na(Y)])

  chr "1"

X[is.na(Y)]

[1] "1"

The last couple of results seem weird at first. The "1" was originally aninteger but now it is a character. This *must* be because the typing isdone at an earlier stage in the process, back when R decides whichelements of X have to be checked against the logical matrix is.na(Y). Itthen decides the type for the vector and only afterward does it find thatonly one of the four elements of X will be selected, but it was preparedfrom that early stage for any of the four, even all four of them, to beselected.

Suppose there were no NA elements in Y, what should we expect to see if werepeat what we did above?

Y <- as.data.frame(matrix(c("c","d",3,4),2,2), stringsAsFactors=FALSE)
Y[,2] <- as.integer(Y[,2])
str(Y)

'data.frame':   2 obs. of  2 variables:
  $ V1: chr  "c" "d"
  $ V2: int  3 4

Even though there are no elements in X[is.na(Y)], the null element is oftype chr:

is.vector(X[is.na(Y)])

[1] TRUE

is.character(X[is.na(Y)])

[1] TRUE

str(X[is.na(Y)])

  chr(0)

X[is.na(Y)]

character(0)

So what happens if we do this...

Y[is.na(Y)] <- X[is.na(Y)]


...will it change the structure of Y so that Y$V2 becomes type chr?

str(Y)

'data.frame':   2 obs. of  2 variables:
  $ V1: chr  "c" "d"
  $ V2: int  3 4

No. I think there is an obvious reason for that: Y was not changed, andmore specifically, Y$V2 was not changed, so no change was made to thevariable types.

It all makes sense, but I want an easy way to maintain the structure of adata frame when I do this kind of operation. I ought to be able to dosomething like this:


Ytypes <- get_types(Y)

Y[is.na(Y)] <- X[is.na(Y)]

use_types(Y, Ytypes)

That kind of system would ensure that the basic structure of the dataframe can be maintained. I don't want to have to check by hand, andsometimes it would be impossible to do so.


So what's the trick?  Is there a trick?

Mike

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] maintaining variable types in data frames

Reply via email to