Hi: I did the following test using function ddply() in the plyr package on a toy data frame with 50000 observations using five studies, 20 subjects per study, 25 cycles per subject, five days per cycle and four observations by type per day. No date-time variable was included.
# Test data frame big <- data.frame(study = factor(rep(1:5, each = 10000)), subject = factor(rep(1:100, each = 500)), cycle = rep(rep(1:25, each = 20), 100), day = rep(rep(1:5, each = 4), 500), type = rep(c('ALB', 'ALP', 'ALT', 'AST'), 12500), obs = rpois(50000, 70) ) > dim(big) [1] 50000 6 # 64-bit R on a Windows 7 box with 8Gb RAM and a 2.93GHz Core Duo chip. system.time(bigtab <- ddply(big, .(study, subject, cycle, day), function(x) xtabs(obs ~ type, data = x))) user system elapsed 30.22 0.02 30.60 > dim(bigtab) [1] 12500 8 > head(bigtab) study subject cycle day ALB ALP ALT AST 1 1 1 1 1 77 80 67 70 2 1 1 1 2 60 54 70 70 3 1 1 1 3 71 77 69 65 4 1 1 1 4 62 71 73 68 5 1 1 1 5 78 67 69 78 6 1 1 2 1 71 69 74 69 > tail(bigtab) study subject cycle day ALB ALP ALT AST 12495 5 100 24 5 75 83 72 70 12496 5 100 25 1 85 52 62 70 12497 5 100 25 2 79 64 84 68 12498 5 100 25 3 67 65 74 81 12499 5 100 25 4 62 86 66 80 12500 5 100 25 5 58 76 85 84 There may be an easier/more efficient way to do this with melt() and cast() in the reshape package, but moved on when I couldn't figure it out within ten minutes (probably because I was thinking 'xtabs of obs by type for study/subject/cycle/day combinations - that's the ticket!' :) Packages sqldf and data.table are other viable options for this sort of task, and now that there is a test data set to play with, it would be interesting to see what else can be done. I'd be surprised if this couldn't be done within a few seconds because the data frame is not that large. Anyway, HTH, Dennis On Thu, Sep 2, 2010 at 12:24 AM, Coen van Hasselt <coenvanhass...@gmail.com>wrote: > Hello, > > I have a data.frame with the following format: > > > head(clin2) > Study Subject Type Obs Cycle Day Date Time > 1 A001101 10108 ALB 44.00000 98 1 2004-03-11 14:26 > 2 A001101 10108 ALP 95.00000 98 1 2004-03-11 14:26 > 3 A001101 10108 ALT 61.00000 98 1 2004-03-11 14:26 > 5 A001101 10108 AST 33.00000 98 1 2004-03-11 14:26 > > I want to transform this data.frame so that I have "Obs" columns for > each "Type". The full dataset is 45000 rows long. For a short subset > of 100 rows, reshaping takes 0.2 seconds, and produces what I want. > All columns are either numeric or character format (incl. date/time). > > > reshape(clin2, v.names="Obs", timevar="Type", > direction="wide",idvar=c("Study","Subject","Cycle","Day","Date","Time"),) > Study Subject Cycle Day Date Time Obs.ALB Obs.ALP Obs.ALT > Obs.AST > 1 A001101 10108 98 1 2004-03-11 14:26 44 95 61 > 33 > 11 A001101 10108 1 1 2004-03-12 14:01 41 85 39 > 33 > 21 A001101 10108 1 8 2004-03-22 10:34 40 90 70 > 34 > 30 A001101 10108 1 15 2004-03-29 09:56 45 97 66 > 48 [........] > > However, when using the same reshape command for the full data.frame > of 45000 rows, it still wasn't finished when run overnight (8 GB RAM + > 8 GB swap in use). > > The time to process this data.frame from a 100-row subset to a > 1000-row subset increases from 0.2 sec to 60 sec. > > I would greatly appreciate a advice why the time for reshaping is > increasing exponentially with the nr. of rows, and how I can do this > in an elegant way. > > Thanks! > > Coen. > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.