[datatable-help] Follow-up on subsetting data.table with NAs

Arunkumar Srinivasan Sun, 09 Jun 2013 14:09:16 -0700

Matthew, 

Regarding your recent answer here: http://stackoverflow.com/a/17008872/559784 
I'd a few questions/thoughts and I thought it may be more appropriate to share 
here (even though I've already written 3 comments!).


1) First, you write that, DT[ColA == ColB] is simpler than DF[!is.na(ColA) & 
!is.na(ColB) & ColA == ColB,]
However, you can write this long expression as: DF[which(DF$ColA == DF$ColB), ]

2) Second, you mention that the motivation is not just convenience but speed. 
By checking:

require(data.table)
set.seed(45)
df <- as.data.frame(matrix(sample(c(1,2,3,NA), 2e6, replace=TRUE), ncol=2))
dt <- data.table(df)
system.time(dt[V1 == V2])
# 0.077 seconds
system.time(df[!is.na(df$V1) & !is.na(df$V2) & df$V1 == df$V2, ])
# 0.252 seconds
system.time(df[which(df$V1 == df$V2), ])

# 0.038 seconds

We see that using `which` (in addition to removing NA) is also faster than 
`DT[V1 == V2]`. In fact, `DT[which(V1 == V2)]` is faster than `DT[V1 == V2]`. I 
suspect this is because of the snippet below in `[.data.table`:

        if (is.logical(i)) {
            if (identical(i,NA)) i = NA_integer_  # see DT[NA] thread re 
recycling of NA logical
            else i[is.na(i)] = FALSE              # avoids DT[!is.na(ColA) & 
!is.na(ColB) & ColA==ColB], just DT[ColA==ColB]
        }


But at the end `irows <- which(i)` is being done:

            if (is.logical(i)) {
                if (length(i)==nrow(x)) irows=which(i)   # e.g. 
DT[colA>3,which=TRUE]


And this "irows" is what's used to index the corresponding rows. So, is the 
replacement of `NA` to FALSE really necessary? I may very well have overlooked 
the purpose of the NA replacement to FALSE for other scenarios, but just by 
looking at this case, it doesn't seem like it's necessary as you fetch 
index/row numbers later.

3) And finally, more of a philosophical point. If we agree that subsetting can 
be done conveniently (using "which") and with no loss of speed (again using 
"which"), then are there other reasons to change the default behaviour of R's 
philosophy of handling NAs as unknowns/missing observations? I find I can 
relate more to the native concept of handling NAs. For example:

x <- c(1,2,3,NA)
x != 3
# TRUE TRUE FALSE NA

makes more sense because `NA != 3` doesn't fall in either TRUE or FALSE, if NA 
is a missing observation/unknown data. The answer "unknown/missing" seems more 
appropriate, therefore.

I'd be interested in hearing, in addition to Matthew's, other's thoughts and 
inputs as well.

Best regards,

Arun

_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

[datatable-help] Follow-up on subsetting data.table with NAs

Reply via email to