Thank you very much indeed Bogdan! > a2[duplicated(a2$mdate),] value2 mdate 318 0 2006-05-10 322 0 2006-05-13 324 0 2006-05-14 326 0 2006-05-15 328 0 2006-05-16
What a relief to know what is causing this problem... now to sort out the root cause! cheers and thanks again! Sean On 22/05/06, bogdan romocea <[EMAIL PROTECTED]> wrote: > Repeated merge()-ing does not always increase the space requirements > linearly. Keep in mind that a join between two tables where the same > value appears M and N times will produce M*N rows for that particular > value. My guess is that the number of rows in atot explodes because > you have some duplicate values in your files (having the same > duplicate date in each data frame would cause atot to contain 4, then > 8, 16, 32, 64... rows for that date). > > > > -----Original Message----- > > From: [EMAIL PROTECTED] > > [mailto:[EMAIL PROTECTED] On Behalf Of Sean O'Riordain > > Sent: Monday, May 22, 2006 10:12 AM > > To: r-help > > Subject: [R] win2k memory problem with merge()'ing repeatedly > > (long email) > > > > Good afternoon, > > > > I have a 63 small .csv files which I process daily, and until two > > weeks ago they processed just fine and only took a matter of moments > > and had non noticeable memory problem. Two weeks ago they have > > reached 318 lines and my script "broke". There are some > > missing-values in some of the files. I have tried hard many times > > over the last two weeks to create a "small" repeatable example to give > > you but I've failed - unless I use my data it works fine... :-( > > > > Am I missing something obvious? (again) > > > > A line in a typical file has lines which look like : > > 01/06/2005,1372 > > > > Though there are three files which have two values (files 3,32,33) and > > these have lines which look like... > > 01/06/2005,1766, > > or > > 15/05/2006,289,114 > > > > a1 <- read.csv("file1.csv",header=F) > > etc... > > a63 <- read.csv("file63.csv",header=F) > > names(a1) <- c("mdate","file1.column.description") > > > > atot <- merge(a1,a2,all=T) > > > > followed by repeatedly doing... > > atot <- merge(atot, a3,all=T) > > atot <- merge(atot, a4,all=T) > > etc... > > > > I normally start R with --vanilla. > > > > What appears to happen is that atot doubles in size each iteration and > > just falls over due to lack of memory at about i=17... even though the > > total memory required for all of these individual a1...a63 is only > > 1001384 bytes (doing an object.size() on a1..a63) > > at this point I've been trying to pin down this problem for two weeks > > and I just gave up... > > > > The following works fine as I'd expect with minimal memory usage... > > > > for (i in 3:67) { > > datelist <- as.Date(start.date)+0:(count-1) > > #remove a couple of elements... > > datelist <- datelist[-(floor(runif(nacount)*count))] > > a2 <- as.data.frame(datelist) > > names(a2) <- "mdate" > > vname <- paste("value", i, sep="") > > a2[vname] <- runif(length(datelist)) > > #a2[floor(runif(nacount)*count), vname] <- NA > > > > # atot <- merge(atot,a2,all=T) > > i <- 2 > > a.eval.text <- paste("merge(atot, a", i, ", all=T)", sep="") > > cat("a.eval.text is: -", a.eval.text, "-\n", sep="") > > atot <- eval(parse(text=a.eval.text)) > > > > cat("i:", i, " ", gc(), "\n") > > } > > > > this works fine... but on my files (as per attached 'lastsave.txt' > > file) it just gobbles memory. > > Am I doing something wrong? I (wrongly?) expected that repeatedly > > merge(atot,aN) would only increase the memory requirement linearly > > (with jumps perhaps as we go through a 2^n boundary)... which is what > > happens when merging simulated data.frames as above... no problem at > > all and its really fast... > > > > The attached text file shows a (slightly edited) session where the > > memory required by the merge() operation just doubles with each use... > > and I can only allow it to run until i=17!!! > > > > I've even run it with gctorture() set on... with similar, but > > excruciatingly slow results... > > > > Is there any relevant info that I'm missing? Unfortunately I am not > > able to post the contents of the files to a public list like this... > > > > As per a previous thread, I know that I can use a list to handle these > > dataframes - but I had difficulty with the syntax of a list of > > dataframes... > > > > I'd like to know why the memory requirements for this merge > > just explode... > > > > cheers, (and thanks in advance!) > > Sean O'Riordain > > > > ============================== > > > version > > _ > > platform i386-pc-mingw32 > > arch i386 > > os mingw32 > > system i386, mingw32 > > status Patched > > major 2 > > minor 3.0 > > year 2006 > > month 05 > > day 09 > > svn rev 38014 > > language R > > version.string Version 2.3.0 Patched (2006-05-09 r38014) > > > > > Running on Win2k with 1Gb ram. > > > > I also tried it (with the same results) on 2.2.1 and 2.3.0. > > > > ======================================================== > > > > R : Copyright 2006, The R Foundation for Statistical Computing > > Version 2.3.0 Patched (2006-05-09 r38014) > > ISBN 3-900051-07-0 > > > > R is free software and comes with ABSOLUTELY NO WARRANTY. > > You are welcome to redistribute it under certain conditions. > > Type 'license()' or 'licence()' for distribution details. > > > > Natural language support but running in an English locale > > > > R is a collaborative project with many contributors. > > Type 'contributors()' for more information and > > 'citation()' on how to cite R or R packages in publications. > > > > Type 'demo()' for some demos, 'help()' for on-line help, or > > 'help.start()' for an HTML browser interface to help. > > Type 'q()' to quit R. > > > > > gc() > > used (Mb) gc trigger (Mb) max used (Mb) > > Ncells 178186 4.8 407500 10.9 350000 9.4 > > Vcells 73112 0.6 786432 6.0 333585 2.6 > > > # take the information in the .csv files created from the emails > > > setwd("C:/Documents and Settings/c_oriordain_s/My > > Documents/pasip/mms/mms_emails") > > > > > > # the input file from Amdocs (as supplied by revenue assurance) > > > amdocs_csv_filename <- "amdocs_volumes_revised4.csv" > > > # where shall we put the output plot file > > > copypath <- "\\\\ient1dfs001\\general\\Process Improvement > > Projects\\Process Improvement Projects Repository\\Active > > Projects\\MMS\\01 Measure\\" > > > > > > # set to F (false) instead of T (true) if you're just > > tricking around and you don't > > > # want to be copying over files to the network drive all the time! > > > do.copy <- F > > > > > > # HOPEFULLY you shouldn't have to trick around with stuff > > below here! > > > # > > > > # EDIT file names changed to protect the innocent... :-) > > > > > a1 <-read.csv("file1.csv",header=F) > > #EDIT etc... all the way to > > > a63 <-read.csv("file63.csv", header=F) > > > > > > # now delete the now irrelevant initial date column for all > > 63 of these temporary objects... > > > for (i in 1:63) { > > + # e.g. should look like a63$mdate <- > > as.Date(a63$V1,format="%d/%m/%Y") > > + anum <- paste("a",i,sep="") > > + eval(parse(text= paste(anum, "$mdate <- as.Date(" ,anum, > > "$V1,format=\"%d/%m/%Y\")",sep="") )) > > + } > > > > > > > > > # three files have three columns... > > > > #EDIT here again... to protect the innocent... > > > > > names(a3)[3] <- "2nd.column.name.in.file.3" > > > names(a32)[3] <- "2nd.column.name.in.file.32" > > > names(a33)[3] <- "2nd.column.name.in.file.33" > > > > > > # the rest only have two columns... > > > > > > names(a1)[2] <- "title.1" > > #EDIT > > > names(a63)[2] <- "title.63" > > > > > > for (i in 1:63) { > > + # now delete the now irrelevant initial date column for all 63 > > of these temporary objects... > > + # e.g. should look like a33[1] <- NULL > > + eval(parse(text=paste("a",i,"[1] <- NULL",sep=""))) > > + } > > > > > > a.object.sizes <- vector() > > > for (i in 1:63) { > > + # now delete these 63 temporary objects... > > + # e.g. should look like rm(a33) > > + a.name <- paste("a", i, sep="") > > + # a.object.sizes[i] <- object.size(a.name) > > + a.object.sizes[i] <- > > eval(parse(text=paste("object.size(",a.name,")", sep=""))) > > + } > > > > > > a.object.sizes > > [1] 17988 17996 19524 17996 17996 18004 17996 18028 17988 17988 17996 > > 17996 17996 18012 18012 17988 17980 18004 18004 > > [20] 18012 19348 19316 19340 17996 18004 18004 18012 18004 19228 19228 > > 18012 19436 19436 19244 19220 17996 17900 17900 > > [39] 17884 17884 17884 17884 17884 17884 17876 17988 17900 17892 8808 > > 17988 8792 8800 8800 8792 8800 8784 17980 > > [58] 17988 17980 9832 9728 9728 9728 > > > > > > # merge these tables into one big dataframe... > > > atot <- merge(a1, a2, all=T) > > > for (i in 3:17) { > > + # construct the text to be evaluated... > > + #atot <- merge(atot, a3, all=T) > > + cat("The size of object a", i, " is ", > > a.object.sizes[i], "\n", sep="") > > + cat("The current size of atot is ", object.size(atot), "\n") > > + a.eval.text <- paste("merge(atot, a", i, ", all=T)", sep="") > > + cat("a.eval.text is: -", a.eval.text, "-\n", sep="") > > + atot <- eval(parse(text=a.eval.text)) > > + cat("i is:", i, gc(), "\n\n") > > + } > > The size of object a3 is 19524 > > The current size of atot is 19988 > > a.eval.text is: -merge(atot, a3, all=T)- > > i is: 3 206289 137020 5.6 1.1 407500 786432 10.9 6 362507 786425 9.7 6 > > > > The size of object a4 is 17996 > > The current size of atot is 24300 > > a.eval.text is: -merge(atot, a4, all=T)- > > i is: 4 206330 137402 5.6 1.1 407500 786432 10.9 6 362507 786425 9.7 6 > > > > The size of object a5 is 17996 > > The current size of atot is 28564 > > a.eval.text is: -merge(atot, a5, all=T)- > > i is: 5 206411 138044 5.6 1.1 407500 786432 10.9 6 362507 786425 9.7 6 > > > > The size of object a6 is 18004 > > The current size of atot is 36044 > > a.eval.text is: -merge(atot, a6, all=T)- > > i is: 6 206572 139246 5.6 1.1 407500 786432 10.9 6 362507 786425 9.7 6 > > > > The size of object a7 is 17996 > > The current size of atot is 50236 > > a.eval.text is: -merge(atot, a7, all=T)- > > i is: 7 206893 141652 5.6 1.1 407500 786432 10.9 6 362507 786425 9.7 6 > > > > The size of object a8 is 18028 > > The current size of atot is 78516 > > a.eval.text is: -merge(atot, a8, all=T)- > > i is: 8 207534 146614 5.6 1.2 407500 786432 10.9 6 362507 786425 9.7 6 > > > > The size of object a9 is 17988 > > The current size of atot is 136252 > > a.eval.text is: -merge(atot, a9, all=T)- > > i is: 9 208815 157016 5.6 1.2 407500 786432 10.9 6 362507 786425 9.7 6 > > > > The size of object a10 is 17988 > > The current size of atot is 255404 > > a.eval.text is: -merge(atot, a10, all=T)- > > i is: 10 211376 178938 5.7 1.4 407500 786432 10.9 6 362507 > > 786425 9.7 6 > > > > The size of object a11 is 17996 > > The current size of atot is 502540 > > a.eval.text is: -merge(atot, a11, all=T)- > > i is: 11 216497 225184 5.8 1.8 467875 889825 12.5 6.8 362507 > > 888747 9.7 6.8 > > > > The size of object a12 is 17996 > > The current size of atot is 1015940 > > a.eval.text is: -merge(atot, a12, all=T)- > > i is: 12 226738 322626 6.1 2.5 531268 1577138 14.2 12.1 > > 362507 1569929 9.7 12 > > > > The size of object a13 is 17996 > > The current size of atot is 2082284 > > a.eval.text is: -merge(atot, a13, all=T)- > > i is: 13 247219 527588 6.7 4.1 597831 2209110 16 16.9 362507 > > 2749247 9.7 21 > > > > The size of object a14 is 18012 > > The current size of atot is 4295524 > > a.eval.text is: -merge(atot, a14, all=T)- > > i is: 14 288180 957830 7.7 7.4 741108 4242831 19.8 32.4 494389 5296330 > > 13.3 40.5 > > > > The size of object a15 is 18012 > > The current size of atot is 8884444 > > a.eval.text is: -merge(atot, a15, all=T)- > > i is: 15 370101 1859128 9.9 14.2 1073225 8314706 28.7 63.5 781279 > > 10388430 20.9 79.3 > > > > The size of object a16 is 17988 > > The current size of atot is 18388580 > > a.eval.text is: -merge(atot, a16, all=T)- > > i is: 16 533942 3743450 14.3 28.6 1590760 17263040 42.5 131.8 1354559 > > 21430459 36.2 163.6 > > > > The size of object a17 is 17980 > > The current size of atot is 38050756 > > a.eval.text is: -merge(atot, a17, all=T)- > > i is: 17 861623 7675772 23.1 58.6 3094291 35309607 82.7 269.4 2501382 > > 44137010 66.8 336.8 > > > > ______________________________________________ > > R-help@stat.math.ethz.ch mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide! > > http://www.R-project.org/posting-guide.html > > > ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html