On Mon, 14 Apr 2008, Gopi Goswami wrote: > Dear All, > > > Thanks a lot for your helpful comments (e.g., NAMED, ExpressionSet, > DNAStringSet). > > > Observations and questions :: > > ooo For a data.frame dd and a list ll with same contents to being with, > the following operations show significant difference in the maximum memory > usage column of the gc( ) output on R-2.6.2 (the detailed code is in the PS > section below). > > ll$xx <- zz > dd$xx <- zz > > My understanding is that the '$<-.data.frame' S3 method above makes a copy > of the whole dd first (using '*tmp*'). But for a list this is avoided due to > the use of SET_VECTOR_ELT at the C-level. Is this a valid explanation or > something deeper is happening behind the scene?
Something deeper -- see the 'R Internals' manual. '$<-' is primitive -- its methods are not. For the list the copy *may* be avoided, if 'll' is the only reference to that object and R has never thought there might be another. > ooo I'll look into the read-only flag idea to avoid unhappy circumstances > that might arise while bypassing the copy-on-modify principle. Any pointers > or code snippets as to how to implement this idea? > > > > ooo The main reason I want to bypass copy-on-modify is that I want to > emulate a Python like behavior for lists (and data.frame), in the sense > that, I want to take the responsibility of making a deep copy if need be, > but most of the time I want to knowingly change 'things in place' using the > proposed S4 class DataFrame. > > > Regards, > Gopi Goswami. > PhD, Statistics, 2005 > http://gopi-goswami.net/index.html > > > > PS: > > zz <- seq_len(1000000) > gc( ) > dd <- data.frame(xx = zz) > dd$yy <- zz > gc( ) > object.size(dd) > > ###################################################################### > > zz <- seq_len(1000000) > gc( ) > ll <- list(xx = zz) > ll$yy <- zz > gc( ) > object.size(ll) > > > > > On Mon, Apr 14, 2008 at 10:18 AM, Tony Plate <[EMAIL PROTECTED]> wrote: > >> Gopi Goswami wrote: >> >>> Hi there, >>> >>> >>> Problem :: >>> When one tries to change one or some of the columns of a data.frame, R >>> makes >>> a copy of the whole data.frame using the '*tmp*' mechanism (this does >>> not >>> happen for components of a list, tracemem( ) on R-2.6.2 says so). >>> >>> >>> Suggested solution :: >>> Store the columns of the data.frame as a list inside of an environment >>> slot >>> of an S4 class, and define the '[', '[<-' etc. operators using >>> setMethod( ) >>> and setReplaceMethod( ). >>> >>> >>> Question :: >>> This implementation will violate copy on modify principle of R (since >>> environments are not copied), but will save a lot of memory. Do you see >>> any >>> other obvious problem(s) with the idea? >>> >> Well, because it violates the copy-on-modify principle it can potentially >> break code that depends on this principle. I don't know how much there is >> -- did you try to see if R and recommended packages will pass checks with >> this change in place? >> >>> Have you seen a related setup >>> implemented / considered before (apart from the packages like filehash, >>> ff, >>> and database related ones for saving memory)? >>> >>> >> I've frequently used a personal package that stores array data in a file >> (like ff). It works fine, and I partially get around the problem of >> violating the copy-on-modify principle by having a readonly flag in the >> object -- when the flag is set to allow modification I have to be careful, >> but after I set it to readonly I can use it more freely with the knowledge >> that if some function does attempt to modify the object, it will stop with >> an error. >> >> In this particular case, why not just track down why data frame >> modification is copying the entire object and suggest a change so that it >> just copies the column being changed? (should be possible if list >> modification doesn't copy all components). >> >> -- Tony Plate >> >>> >>> Implementation code snippet :: >>> ### The S4 class. >>> setClass('DataFrame', >>> representation(data = 'data.frame', nrow = 'numeric', ncol >>> = >>> 'numeric', store = 'environment'), >>> prototype(data = data.frame( ), nrow = 0, ncol = 0)) >>> >>> setMethod('initialize', 'DataFrame', function(.Object) { >>> .Object <- callNextMethod( ) >>> [EMAIL PROTECTED] <- new.env(hash = TRUE) >>> assign('data', as.list([EMAIL PROTECTED]), [EMAIL PROTECTED]) >>> [EMAIL PROTECTED] <- nrow([EMAIL PROTECTED]) >>> [EMAIL PROTECTED] <- ncol([EMAIL PROTECTED]) >>> [EMAIL PROTECTED] <- data.frame( ) >>> .Object >>> }) >>> >>> >>> ### Usage: >>> nn <- 10 >>> ## dd1 below could possibly be created by read.table or scan and >>> data.frame >>> dd1 <- data.frame(xx = rnorm(nn), yy = rnorm(nn)) >>> dd2 <- new('DataFrame', data = dd1) >>> rm(dd1) >>> ## Now work with dd2 >>> >>> >>> Thanks a lot, >>> Gopi Goswami. >>> PhD, Statistics, 2005 >>> http://gopi-goswami.net/index.html >>> >>> [[alternative HTML version deleted]] >>> >>> ______________________________________________ >>> R-devel@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-devel >>> >>> >>> >> >> > > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595 ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel