Re: [datatable-help] Summing over many variables

Joseph Voelkel Thu, 30 Dec 2010 12:00:23 -0800

Thanks, Matthew.

1. Yes, you have the subsetting of j on your faq.
2. The double eval appears to handle this subsetting. In my smaller problem, 
which I am working on first, I have about 55K records and 250 variables. Using 
either A1+A2+A3+A4+A5 or eval(eval()) takes about 0.22 sec of user time. So, 
from this indirect measure, the two are equally efficient.
3. By the way, for this example, I used 6 keys, which happened for this problem 
to correspond to 1 record for each unique key. That is, my output also 
contained 55K records. I originally solved the problem by using apply with sum 
on the five columns that contained the A1 through A5 value, e.g. 
apply(DT1[,11:15,with=FALSE],1,sum). This operation took about 0.62 sec. So, 
even here, data.table is 3x faster than apply.
4. Of course, no key is really needed here, so if I just want to return the sum 
along with the key vars, I can just use 
DT1[,list(key1,key2,key3,key4,key5,key6,sum=A1+A2+A3+A4+A5)] which runs in 
under 0.01 seconds.
5. Finally, I tried your idea of removing the quote and just trying the one 
eval(). It worked with a simple contrived example, but not for my more complex 
one--I have no idea why not, because the two seem analogous...


 

-----Original Message-----
From: Matthew Dowle [mailto:[email protected]] On Behalf Of Matthew 
Dowle
Sent: Tuesday, December 28, 2010 12:22 PM
To: Joseph Voelkel
Cc: [email protected]
Subject: Re: [datatable-help] Summing over many variables

Glad that works. Thanks for posting back. One thintg with that approach
is that data.table inspects the j expression to see which columns it
uses. It only subsets the ones that are used, for efficiency. There's a
faq on that I think. If the expression is wrapped up inside an eval I
think it still inspects the j but I can't quite remember. I'd be
surprised if that works with the double eval like that. If A runs from 1
to 100 in your real data and you're taking many sub-sums of 5, then this
could make a big difference. Try timing sum(A1) vs sum(A2+A3+A4+A5) with
and without the eval(eval()). That should reveal whether the j is being
inspected ok.
Also looking at it again, you shouldn't need the quote() inside the text
passed to parse. Then it's just a single eval and j inspection should be
ok I think i.e. DT1[,eval(ASumExpr),by=grp] rather than
DT1[,eval(eval(ASumExpr)),by=grp]

Matthew


On Mon, 2010-12-27 at 13:23 -0500, Joseph Voelkel wrote:
> I like Matthew's idea of flattening tables. But, as usual, I did not tell the 
> whole story in my first post. I will probably want to look at many 
> expressions, for example,
> 
> sum(A1+A2+A3+A4+A5)
> sum(A2+A3+A4+A5+A6)
> sum(A3+A4+A5+A6+A7)
> sum((A1+A2)/2 - (A3+A4)/2)
> 
> To be able to investigate a sequence of these easily, I found (after some 
> trial and error, and then thinking about it a bit more to try to make my 
> problem look like one from the datatable-faq) that this will do the trick:
> 
> library(data.table)
> 
> # create data table
> DT1<-data.table(A1=1:1000000,A2=1:1000000,A3=1:1000000,A4=1:1000000,A5=1:1000000,grp=rep(1:50000,each=20))
> setkey(DT1,grp)
> 
> # Say I want DT1[,sum(A1+A2+A3+A4+A5),by=grp]
> 
> # First, create expression of interest, and convert it to data-table-useful 
> form
> ASumExpr<-parse(text=paste("quote(sum(",paste("A",1:5,sep="",collapse="+"),"))",sep=""))
> # (Next few lines: to help me and maybe you see what this looks like...)
> ASumExpr
> str(ASumExpr)
> eval(ASumExpr)
> str(eval(ASumExpr))
> str(quote(mean(x))) # from example in datatable-faq.pdf. So eval(ASumExpr) 
> looks good
> 
> # long-hand typing method. OK for one or two, but not in general
> system.time(dt2a<-DT1[,sum(A1+A2+A3+A4+A5),by=grp])
> # formula method. This will be useful.
> system.time(dt2b<-DT1[,eval(eval(ASumExpr)),by=grp])
> 
> identical(dt2a, dt2b)
> 
> # Fast and easy to write. Just what I wanted. Thanks again for the ideas that 
> lead to this useful solution.
> 
> Joe V.
> 
> -----Original Message-----
> From: Matthew Dowle [mailto:[email protected]] On Behalf Of 
> Matthew Dowle
> Sent: Thursday, December 23, 2010 4:33 PM
> To: Joseph Voelkel
> Cc: [email protected]
> Subject: Re: [datatable-help] Summing over many variables
> 
> 
> Yes that's one way. We aren't that happy with using lapply in j as it
> loses the benefit of data.table.
> 
> I tend to 'flatten' tables like this. Try to have few columns. In this
> case it would be either a 3 column table (grp,colname,value) or maybe a
> 4 column table if you ever want to group by "A" or
> "B" (grp,letter,number,value). The query would then be
> DT[,sum(value),by=list(grp,letter,number)]. You can then do pattern
> matches and filters etc in the i rather than in the j e.g.
> DT[letter=="A",sum(value),by=group] for just the "A"s. The answer comes
> out in 'flat' format but you can always 'unflatten' the result to make
> it look pretty or easier to read.  [Note that I sinned by using '==' in
> the i just then invoking vector scan, so to avoid that for speed you
> would setkey(letter,group) then DT["A",sum(value),by=group]], or getting
> fancy if you only wanted some groups (say 1 and 3) then 'by without by'
> e.g. DT[list("A",c(1,3)),sum(value)].
> 
> 'flat' is a common way to use data.table to store higher dimensional
> data, and especially sparse higher dimensional data.
> 
> The 'grp.1' repetition is a problem I'd like to remove. It's related to
> this feature request (but is almost a bug). At the moment you have to
> remove the grp.1 afterwards.
> https://r-forge.r-project.org/tracker/index.php?func=detail&aid=978&group_id=240&atid=978
> 
> Matthew
> 
> _______________________________________________
> datatable-help mailing list
> [email protected]
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

Re: [datatable-help] Summing over many variables

Reply via email to