I'm having trouble using split on a very large data-set with ~1400 levels of the factor to be split. Unfortunately, I can't reproduce it with the simple self-contained example below. As you can see, splitting the artificial dataframe of size ~13MB results in a split dataframe of ~ 144MB, with an increase memory allocation of ~10 fold for the split object. If split scales linearly, then my actual 52MB dataframe should be easily handled by my 12GB of RAM, but it is not. instead, when I try to split selectSubAct.df on one of its factors with 1473 levels, my memory is slowly gobbled up (plus 3 GB of swap) until I cancel the operation.
Any ideas on what might be happening? Thanks, Mark myDataFrame <- data.frame(matrix(LETTERS, ncol = 7, nrow = 399000)) mySplitVar <- factor(as.character(1:1400)) myDataFrame <- cbind(myDataFrame, mySplitVar) object.size(myDataFrame) ## 12860880 bytes # ~ 13MB myDataFrame.split <- split(myDataFrame, myDataFrame$mySplitVar) object.size(myDataFrame.split) ## 144524992 bytes # ~ 144MB object.size(selectSubAct.df) ## 52,348,272 bytes # ~ 52MB > sessionInfo() R version 2.10.0 Patched (2009-10-27 r50222) x86_64-unknown-linux-gnu locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices datasets utils methods base loaded via a namespace (and not attached): [1] tools_2.10.0 Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry Indiana University School of Medicine 15032 Hunter Court, Westfield, IN 46074 (317) 490-5129 Work, & Mobile & VoiceMail (317) 399-1219 Skype No Voicemail please [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.