Re: [datatable-help] datatable-help Digest, Vol 17, Issue 10.

Joseph Voelkel Fri, 15 Jul 2011 07:50:48 -0700

Thanks, Dennis!

(Dennis is a former student of mine who clearly knows data.table much better 
than I do.)

Joe

-----Original Message-----
From: Dennis Murphy [mailto:[email protected]] 
Sent: Friday, July 15, 2011 10:34 AM
To: [email protected]; Joseph Voelkel
Subject: Re: datatable-help Digest, Vol 17, Issue 10

Hi:

<A bunch snipped because I get the archives in digest form>

Re Prof. Voelkel's recent posts:

(1) Quoting does not work well in data.table; this is mentioned in
several of the FAQs. Apropos to this discussion, some of the relevant
ones include 1.2, 1.6 and 2.1; there may be others :)

(2) Steve's response seems to be the right way to go (although see
below), but I thought I'd up the stakes a little and assume that Prof.
Voelkel has a large number of variables, only a subset of which he may
want summarized in a particular go. To that end, I created the
following toy data frame cum data.table; this is as much for my own
edification as anyone else's (which explains the eventual length of
this post...I got curious :)

This goes against the advice given in the first example of the
data.table wiki, but if you have, say, 100 variables to select out of
a possible 1000, it doesn't make sense to list them individually as
recommended on the wiki. (But see below...)

library('data.table')
set.seed(1043)
m <- matrix(rpois(240, 10), nrow = 6)
colnames(m) <- paste('A', 1:40, sep = '')
m <- as.data.frame(m)
dt2 <- data.table(x = rep(1:3, 2), y = rep(1:3, each = 2), m, key = 'x')
dim(dt2)
# [1]  6 42       ...so far, so good

# Subset of variables for which sums are desired
vars <- paste('A', c(1, 4, 10, 15, 31), sep = '')

# One approach: use the select = argument of subset() to restrict
# the variables under consideration:
dt2[, lapply(subset(.SD, select = vars), sum), by = 'x']
     x A1 A4 A10 A15 A31
[1,] 1 18 21  22  22  24
[2,] 2 20 13  27  23  21
[3,] 3 22 15  16  23  15

# Use the with = FALSE construct of data.table to do the same:
dt2[, lapply(.SD[, vars, with = FALSE], sum), by = 'x, y']
     x y A1 A4 A10 A15 A31
[1,] 1 1 11 13  12  11  16
[2,] 1 2  7  8  10  11   8
[3,] 2 1 10  4  16   7  11
[4,] 2 3 10  9  11  16  10
[5,] 3 2 11  8   7  11   7
[6,] 3 3 11  7   9  12   8

# For this example, it is the same (apart from the key variables) as
dt2[, vars, with = FALSE]

Not bad for this small example, but what happens in a much larger data table?

To find out, I created a 10000 x 1000 matrix that I converted into a
data table, added two grouping variables of 100 levels each and then
tried both approaches above again. Performance isn't bad when
summarizing over one variable, but there is a definite hit when two
variables are summarized. [It makes some sense since one is grouping
over 10000 level combinations rather than 100, but once again, keep
reading.] Curiously, it makes no difference if there is one key
variable or two, which made me wonder what the preferred approach is
in this circumstance.

m <- matrix(rpois(10000000, 10), nrow = 10000)
m <- as.data.table(m)
m <- transform(m, x = rep(1:100, each = 100), y = rep(1:100, 100))
setkey(m, 'x')
dim(m)
# [1] 10000  1002

# Randomly select 150 variables from the 1000
vars <- paste('A', sample(1:1000, 150, replace = FALSE), sep = '')
length(vars)
# [1] 150
key(m)
# [1] "x"
> system.time(m[, lapply(subset(.SD, select = vars), sum), by = 'x'])
   user  system elapsed
   0.75    0.00    0.75
> system.time(m[, lapply(.SD[, vars, with = FALSE], sum), by = 'x'])
   user  system elapsed
   0.64    0.00    0.64
> system.time(m[, lapply(subset(.SD, select = vars), sum), by = 'x, y'])
   user  system elapsed
  53.65    0.00   53.85
> system.time(m[, lapply(.SD[, vars, with = FALSE], sum), by = 'x, y'])
   user  system elapsed
  44.21    0.01   44.35

m2 <- data.table(m, key = 'x, y')
rm(m)
key(m2)
# [1] "x" "y"
> system.time(m2[, lapply(subset(.SD, select = vars), sum), by = 'x, y'])
   user  system elapsed
  53.54    0.00   53.73
> system.time(m2[, lapply(.SD[, vars, with = FALSE], sum), by = 'x, y'])
   user  system elapsed
  44.30    0.04   44.60

The first question in the wiki
(http://rwiki.sciviews.org/doku.php?id=packages:cran:data.table) says
to use the columns directly rather than to rely on .SD. I wanted to
know how to pass new names to the summaries instead of overwriting the
original variable names. For the fun of it, I tried the following:

select <- sample(1:1000, 150, replace = FALSE)
vars <- paste('A', select, sep = '')
outvars <- paste('S', select, sep = '')

# Create a long expression of the form 'list(..., Sn = sum(An), ...)',
# n a subscript from 1 to 150.
expr <- paste('list(', paste(outvars, paste('sum(', vars, ')', sep =
''), sep = '=', collapse = ','),
               ')', sep = '')
u <- m2[, eval(parse(text = expr)), by = 'x']
> dim(u)
# [1] 100 151     seems reasonable...

This seemed to run rather fast, so I decided to time it:

> system.time(m2[, eval(parse(text = expr)), by = 'x'])
   user  system elapsed
   0.03    0.00    0.03
> system.time(m2[, eval(parse(text = expr)), by = 'x, y'])
   user  system elapsed
   1.05    0.00    1.04

I've got to admit, this is not the approach I would have taken
normally, is certainly not intuitively obvious to me and flouts the
usual advice to avoid the eval(parse(text = )) mantra, but the data
don't lie :)  Please tell me there's a more code-efficient way to do
this (the new variable names included), because my 'solution' was a
complete kludge and accidental kewpie prize.

Cheers,
Dennis

> Message: 1
> Date: Thu, 14 Jul 2011 16:36:11 -0400
> From: Joseph Voelkel <[email protected]>
> Subject: [datatable-help] Skipping some Vi names
> To: "[email protected]"
>        <[email protected]>
> Message-ID:
>        <[email protected]>
> Content-Type: text/plain; charset="us-ascii"
>
> I don't use data.table too much (though I probably should use it more...).
>
> I was surprised at the results below. It appears that the name V1 gets 
> assigned to the first result, but then the keys ("in the background") are 
> assigned the next set of Vi names, creating a gap in the names depending on 
> the number of keys. I would like to see the Vi names appear in their natural, 
> sequential, order. Not a show stopper, but it's annoying. (I have over 40 
> Vi's and it'd be good to have them numbered more rationally.) Thanks.
>
>> dt<-data.table(x=c(1,2,3,1,2,3),y=c(1,1,2,2,3,3),A1=1:6,A2=7:12,A3=13:18,key="x")
>> dt[,list("sum(A1),sum(A2),sum(A3)"),by="x"]
>     x V1 V3 V4
> [1,] 1  5 17 29
> [2,] 2  7 19 31
> [3,] 3  9 21 33
>> key(dt)<-c("x","y")
>> dt[,list("sum(A1),sum(A2),sum(A3)"),by="x,y"]
>     x y V1 V4 V5
> [1,] 1 1  1  7 13
> [2,] 1 2  4 10 16
> [3,] 2 1  2  8 14
> [4,] 2 3  5 11 17
> [5,] 3 2  3  9 15
> [6,] 3 3  6 12 18
>
>
>
> Joseph G. Voelkel, Ph.D.
> Professor, Center for Quality and Applied Statistics
> Kate Gleason College of Engineering
> Rochester Institute of Technology
> V 585-475-2231
> F 585-475-5959
> [email protected]
>
_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

Re: [datatable-help] datatable-help Digest, Vol 17, Issue 10.

Reply via email to