Dear Dennis, cast() is in this case much faster.
> system.time(bigtab <- ddply(big, .(study, subject, cycle, day), function(x) xtabs(obs ~ type, data = x))) user system elapsed 35.36 0.12 35.53 > system.time(bigtab2 <- cast(data = big, study + subject + cycle + day ~type, value = "obs", fun = mean)) user system elapsed 4.09 0.00 4.09 I have the feeling that ddply() has a lot of overhead when the number of levels is large. HTH, Thierry ------------------------------------------------------------------------ ---- ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek team Biometrie & Kwaliteitszorg Gaverstraat 4 9500 Geraardsbergen Belgium Research Institute for Nature and Forest team Biometrics & Quality Assurance Gaverstraat 4 9500 Geraardsbergen Belgium tel. + 32 54/436 185 thierry.onkel...@inbo.be www.inbo.be To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey > -----Oorspronkelijk bericht----- > Van: r-help-boun...@r-project.org > [mailto:r-help-boun...@r-project.org] Namens Dennis Murphy > Verzonden: donderdag 2 september 2010 11:34 > Aan: Coen van Hasselt > CC: r-help@r-project.org > Onderwerp: Re: [R] reshape to wide format takes extremely long > > Hi: > > I did the following test using function ddply() in the plyr > package on a toy data frame with 50000 observations using > five studies, 20 subjects per study, 25 cycles per subject, > five days per cycle and four observations by type per day. No > date-time variable was included. > > # Test data frame > big <- data.frame(study = factor(rep(1:5, each = 10000)), > subject = factor(rep(1:100, each = 500)), > cycle = rep(rep(1:25, each = 20), 100), > day = rep(rep(1:5, each = 4), 500), > type = rep(c('ALB', 'ALP', 'ALT', 'AST'), 12500), > obs = rpois(50000, 70) ) > > dim(big) > [1] 50000 6 > > # 64-bit R on a Windows 7 box with 8Gb RAM and a 2.93GHz Core > Duo chip. > system.time(bigtab <- ddply(big, .(study, subject, cycle, > day), function(x) xtabs(obs ~ type, data = x))) > user system elapsed > 30.22 0.02 30.60 > > > dim(bigtab) > [1] 12500 8 > > head(bigtab) > study subject cycle day ALB ALP ALT AST > 1 1 1 1 1 77 80 67 70 > 2 1 1 1 2 60 54 70 70 > 3 1 1 1 3 71 77 69 65 > 4 1 1 1 4 62 71 73 68 > 5 1 1 1 5 78 67 69 78 > 6 1 1 2 1 71 69 74 69 > > tail(bigtab) > study subject cycle day ALB ALP ALT AST > 12495 5 100 24 5 75 83 72 70 > 12496 5 100 25 1 85 52 62 70 > 12497 5 100 25 2 79 64 84 68 > 12498 5 100 25 3 67 65 74 81 > 12499 5 100 25 4 62 86 66 80 > 12500 5 100 25 5 58 76 85 84 > > There may be an easier/more efficient way to do this with > melt() and cast() in the reshape package, but moved on when I > couldn't figure it out within ten minutes (probably because I > was thinking 'xtabs of obs by type for > study/subject/cycle/day combinations - that's the ticket!' :) > Packages sqldf and data.table are other viable options for > this sort of task, and now that there is a test data set to > play with, it would be interesting to see what else can be > done. I'd be surprised if this couldn't be done within a few > seconds because the data frame is not that large. > > Anyway, HTH, > Dennis > > > > On Thu, Sep 2, 2010 at 12:24 AM, Coen van Hasselt > <coenvanhass...@gmail.com>wrote: > > > Hello, > > > > I have a data.frame with the following format: > > > > > head(clin2) > > Study Subject Type Obs Cycle Day Date Time > > 1 A001101 10108 ALB 44.00000 98 1 2004-03-11 14:26 > > 2 A001101 10108 ALP 95.00000 98 1 2004-03-11 14:26 > > 3 A001101 10108 ALT 61.00000 98 1 2004-03-11 14:26 > > 5 A001101 10108 AST 33.00000 98 1 2004-03-11 14:26 > > > > I want to transform this data.frame so that I have "Obs" > columns for > > each "Type". The full dataset is 45000 rows long. For a > short subset > > of 100 rows, reshaping takes 0.2 seconds, and produces what I want. > > All columns are either numeric or character format (incl. > date/time). > > > > > reshape(clin2, v.names="Obs", timevar="Type", > > > direction="wide",idvar=c("Study","Subject","Cycle","Day","Date > ","Time"),) > > Study Subject Cycle Day Date Time Obs.ALB > Obs.ALP Obs.ALT > > Obs.AST > > 1 A001101 10108 98 1 2004-03-11 14:26 44 > 95 61 > > 33 > > 11 A001101 10108 1 1 2004-03-12 14:01 41 > 85 39 > > 33 > > 21 A001101 10108 1 8 2004-03-22 10:34 40 > 90 70 > > 34 > > 30 A001101 10108 1 15 2004-03-29 09:56 45 > 97 66 > > 48 [........] > > > > However, when using the same reshape command for the full > data.frame > > of 45000 rows, it still wasn't finished when run overnight > (8 GB RAM + > > 8 GB swap in use). > > > > The time to process this data.frame from a 100-row subset to a > > 1000-row subset increases from 0.2 sec to 60 sec. > > > > I would greatly appreciate a advice why the time for reshaping is > > increasing exponentially with the nr. of rows, and how I > can do this > > in an elegant way. > > > > Thanks! > > > > Coen. > > > > ______________________________________________ > > R-help@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > Druk dit bericht a.u.b. niet onnodig af. Please do not print this message unnecessarily. Dit bericht en eventuele bijlagen geven enkel de visie van de schrijver weer en binden het INBO onder geen enkel beding, zolang dit bericht niet bevestigd is door een geldig ondertekend document. The views expressed in this message and any annex are purely those of the writer and may not be regarded as stating an official position of INBO, as long as the message is not confirmed by a duly signed document. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.