Hi:

On Sun, Sep 12, 2010 at 11:39 AM, Anyi Zhu <anyi....@gmail.com> wrote:

> Hi all,
>
> I am just wondering if there is a more efficient way of merging two large
> datasets based on the values of multiple columns, some of which are not
> numerical.
>

As noted in your previous post, the variables comprising the key in
data.table
must be of *storage mode* integer, as expressed in the error message you
received. This can be checked with

storage.mode(obj)

In a data frame/data.table, you can check the storage mode of all variables
with
sapply(df, storage.mode)

Integers and factors are both of storage mode integer, so if you have a
character variable, convert it to a factor before applying setkey().

> x <- LETTERS[1:3]
> storage.mode(x)
[1] "character"
> x2 <- factor(x)
> storage.mode(x2)
[1] "integer"
You can always overwrite x if you wish.

The default merge function in dataframe is very inefficient and the merge
> function in data.table seems to be faster, but it does not seem to allow
> keys that are not numerical in nature.
>

Merging in data.table is faster because it uses binary scans on the key
variables in each data table rather than the vector scans used on
the by variables in data frames. data.table also allows a wider variety
of join operations than does merge(), so I would agree with you re speed.

However, keys need to have an integer *storage mode* - they do not have
to be integer per se, as the example above shows.

Here's a toy example to illustrate the point:

library(data.table)

# d is character in both d1 and d2
d1 <- data.frame(b = rep(1:3, each = 3), d = rep(LETTERS[1:3], 3),
                 v1 = rpois(9, 10), stringsAsFactors = FALSE)
d2 <- data.frame(b = rep(1:3, each = 3), d = rep(LETTERS[1:3], 3),
                 v2 = rpois(9, 5), stringsAsFactors = FALSE)

# Try to convert d1 to a data table with d as a key variable,
# where d is character
t1 <- data.table(d1, key = 'b, d')   # Ooops...
Error in setkey(value, b, d) :
  All keyed columns must be storage mode integer

str(d1)
'data.frame':   9 obs. of  3 variables:
 $ b : int  1 1 1 2 2 2 3 3 3
 $ d : chr  "A" "B" "C" "A" ...
 $ v1: num  9 12 8 10 9 9 11 10 10

# Convert d to factor in both data frames and try again:
d1$d <- factor(d1$d)
d2$d <- factor(d2$d)

t1 <- data.table(d1, key = 'b, d')
t2 <- data.table(d2, key = 'b, d')
tt<- data.table(merge(t1, t2))
> tt
      b d v1 v2
 [1,] 1 A  9  3
 [2,] 1 B 12  7
 [3,] 1 C  8  2
 [4,] 2 A 10  4
 [5,] 2 B  9  7
 [6,] 2 C  9  4
 [7,] 3 A 11  2
 [8,] 3 B 10  5
 [9,] 3 C 10  4

HTH,
Dennis


> Any other suggestion?
>
> Thanks a lot!
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to