Hi:

This should give you some idea of what Steve is talking about:

library(data.table)
dt <- data.table(x = sample(100000, 10000000, replace = TRUE),
                  y = rnorm(10000000), key = "x")
dt[, .N, by = x]
system.time(dt[, .N, by = x])

...on my system, dual core 8Gb RAM running Win7 64-bit,
> system.time(dt[, .N, by = x])
   user  system elapsed
   0.12    0.02    0.14

.N is an optimized function to find the number of rows of each data subset.
Much faster than aggregate(). It might take a little longer because you
have more columns that suck up space, but you get the idea. It's also about
5-6 times faster if you set a key variable in the data table than if you
don't.

Dennis

On Fri, Sep 14, 2012 at 12:26 PM, Sam Steingold <s...@gnu.org> wrote:

> I have a large data.frame Z (2,424,185,944 bytes, 10,256,441 rows, 17
> columns).
> I want to get the result of
> table(aggregate(Z$V1, FUN = length, by = list(id=Z$V2))$x)
> alas, aggregate has been running for ~30 minute, RSS is 14G, VIRT is
> 24.3G, and no end in sight.
> both V1 and V2 are characters (not factors).
> Is there anything I could do to speed this up?
> Thanks.
>
> --
> Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X
> 11.0.11103000
> http://www.childpsy.net/ http://www.PetitionOnline.com/tap12009/
> http://dhimmi.com http://think-israel.org http://iris.org.il
> WinWord 6.0 UNinstall: Not enough disk space to uninstall WinWord
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to