Gopi Goswami wrote: > Hi there, > > > Problem :: > When one tries to change one or some of the columns of a data.frame, R makes > a copy of the whole data.frame using the '*tmp*' mechanism (this does not > happen for components of a list, tracemem( ) on R-2.6.2 says so). > > > Suggested solution :: > Store the columns of the data.frame as a list inside of an environment slot > of an S4 class, and define the '[', '[<-' etc. operators using setMethod( ) > and setReplaceMethod( ). > > > Question :: > This implementation will violate copy on modify principle of R (since > environments are not copied), but will save a lot of memory. Do you see any > other obvious problem(s) with the idea? Have you seen a related setup > implemented / considered before (apart from the packages like filehash, ff, > and database related ones for saving memory)? > > > A short --- although crass --- reply is that you should not meddle with this until you know _exactly_ what you are doing....
Two main points are that (a) copying of dataframes in principle only copies pointers to each variable, until the actual contents are modified and (b) breaking copy-on-modify (and consequently effectively also break pass-by-value) semantics is a source of unhappiness. R does duplicate rather more than it needs to, but the main reason probably lies in its rudimentary reference tracking (the NAMED entry in the object header structure). Some of us do wish we could try and fix this at some point, but it would be a major undertaking. (There are a zillion places where we'd need to do extra housekeeping rather than let the garbage collector tidy up after us. Also, reference-counting solutions from other computer languages do not apply because R can have circular references.) > Implementation code snippet :: > ### The S4 class. > setClass('DataFrame', > representation(data = 'data.frame', nrow = 'numeric', ncol = > 'numeric', store = 'environment'), > prototype(data = data.frame( ), nrow = 0, ncol = 0)) > > setMethod('initialize', 'DataFrame', function(.Object) { > .Object <- callNextMethod( ) > [EMAIL PROTECTED] <- new.env(hash = TRUE) > assign('data', as.list([EMAIL PROTECTED]), [EMAIL PROTECTED]) > [EMAIL PROTECTED] <- nrow([EMAIL PROTECTED]) > [EMAIL PROTECTED] <- ncol([EMAIL PROTECTED]) > [EMAIL PROTECTED] <- data.frame( ) > .Object > }) > > > ### Usage: > nn <- 10 > ## dd1 below could possibly be created by read.table or scan and data.frame > dd1 <- data.frame(xx = rnorm(nn), yy = rnorm(nn)) > dd2 <- new('DataFrame', data = dd1) > rm(dd1) > ## Now work with dd2 > > > Thanks a lot, > Gopi Goswami. > PhD, Statistics, 2005 > http://gopi-goswami.net/index.html > > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > -- O__ ---- Peter Dalgaard Ă˜ster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - ([EMAIL PROTECTED]) FAX: (+45) 35327907 ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel