Re: [R] problem with split eating giga-bytes of memory

Mark Kimpel Tue, 08 Dec 2009 19:55:24 -0800

Hadley, Just as you were apparently writing I had the same thought and did
exactly what you suggested, converting all columns except the one that I
want split to character. Executed almost instantaneously without problem.
Thanks! Mark


Mark W. Kimpel MD  ** Neuroinformatics ** Dept. of Psychiatry
Indiana University School of Medicine

15032 Hunter Court, Westfield, IN  46074

(317) 490-5129 Work, & Mobile & VoiceMail
(317) 399-1219 Skype No Voicemail please


On Tue, Dec 8, 2009 at 10:48 PM, hadley wickham <h.wick...@gmail.com> wrote:

> Hi Mark,
>
> Why are you using factors?  I think for this case you might find
> characters are faster and more space efficient.
>
> Alternatively, you can have a look at the plyr package which uses some
> tricks to keep memory usage down.
>
> Hadley
>
> On Tue, Dec 8, 2009 at 9:46 PM, Mark Kimpel <mwkim...@gmail.com> wrote:
> > Charles, I suspect your are correct regarding copying of the attributes.
> > First off, selectSubAct.df is my "real" data, which turns out to be of
> the
> > same dim() as myDataFrame below, but each column is make up of strings,
> not
> > simple letters, and there are many levels in each column, which I did not
> > properly duplicate in my first example. I have ammended that below and
> with
> > the split the new object size is now not 10X the size of the original,
> but
> > 100X. My "real" data is even more complex than this, so I suspect that is
> > where the problem lies. I need to search for a better solution to my
> problem
> > than split, for which I will start a separate thread if I can't figure
> > something out.
> >
> > Thanks for pointing me in the right direction,
> >
> > Mark
> >
> > myDataFrame <- data.frame(matrix(paste("The rain in Spain",
> > as.character(1:1400), sep = "."), ncol = 7, nrow = 399000))
> > mySplitVar <- factor(paste("Rainy days and Mondays",
> as.character(1:1400),
> > sep = "."))
> > myDataFrame <- cbind(myDataFrame, mySplitVar)
> > object.size(myDataFrame)
> > ## 12860880 bytes # ~ 13MB
> > myDataFrame.split <- split(myDataFrame, myDataFrame$mySplitVar)
> > object.size(myDataFrame.split)
> > ## 1,274,929,792 bytes ~ 1.2GB
> > object.size(selectSubAct.df)
> > ## 52,348,272 bytes # ~ 52MB
> > Mark W. Kimpel MD  ** Neuroinformatics ** Dept. of Psychiatry
> > Indiana University School of Medicine
> >
> > 15032 Hunter Court, Westfield, IN  46074
> >
> > (317) 490-5129 Work, & Mobile & VoiceMail
> > (317) 399-1219 Skype No Voicemail please
> >
> >
> > On Tue, Dec 8, 2009 at 10:22 PM, Charles C. Berry <cbe...@tajo.ucsd.edu
> >wrote:
> >
> >> On Tue, 8 Dec 2009, Mark Kimpel wrote:
> >>
> >>  I'm having trouble using split on a very large data-set with ~1400
> levels
> >>> of
> >>> the factor to be split. Unfortunately, I can't reproduce it with the
> >>> simple
> >>> self-contained example below. As you can see, splitting the artificial
> >>> dataframe of size ~13MB results in a split dataframe of ~ 144MB, with
> an
> >>> increase memory allocation of ~10 fold for the split object. If split
> >>> scales
> >>> linearly, then my actual 52MB dataframe should be easily handled by my
> >>> 12GB
> >>> of RAM, but it is not. instead, when I try to split selectSubAct.df on
> one
> >>> of its factors with 1473 levels, my memory is slowly gobbled up (plus 3
> GB
> >>> of swap) until I cancel the operation.
> >>>
> >>> Any ideas on what might be happening? Thanks, Mark
> >>>
> >>
> >> Each element of myDataFrame.split contains a copy of the attributes of
> the
> >> parent data.frame.
> >>
> >> And probably it does scale linearly. But the scaling factor depends on
> the
> >> size of the attributes that get copied, I guess.
> >>
> >>
> >>
> >>
> >>> myDataFrame <- data.frame(matrix(LETTERS, ncol = 7, nrow = 399000))
> >>> mySplitVar <- factor(as.character(1:1400))
> >>> myDataFrame <- cbind(myDataFrame, mySplitVar)
> >>> object.size(myDataFrame)
> >>> ## 12860880 bytes # ~ 13MB
> >>> myDataFrame.split <- split(myDataFrame, myDataFrame$mySplitVar)
> >>> object.size(myDataFrame.split)
> >>> ## 144524992 bytes # ~ 144MB
> >>>
> >>
> >> Note:
> >>
> >>  only.attr <- lapply(myDataFrame.split,function(x) sapply(x,attributes))
> >>>
> >>>
> (object.size(myDataFrame.split)-object.size(myDataFrame))/object.size(only.attr)
> >>>
> >> 1.03726179240978 bytes
> >>
> >>
> >>>
> >>
> >>  object.size(selectSubAct.df)
> >>> ## 52,348,272 bytes # ~ 52MB
> >>>
> >>
> >> What was this??
> >>
> >>
> >> Chuck
> >>
> >>
> >>>  sessionInfo()
> >>>>
> >>> R version 2.10.0 Patched (2009-10-27 r50222)
> >>> x86_64-unknown-linux-gnu
> >>>
> >>> locale:
> >>> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
> >>> [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
> >>> [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
> >>> [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
> >>> [9] LC_ADDRESS=C               LC_TELEPHONE=C
> >>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> >>>
> >>> attached base packages:
> >>> [1] stats     graphics  grDevices datasets  utils     methods   base
> >>>
> >>> loaded via a namespace (and not attached):
> >>> [1] tools_2.10.0
> >>>
> >>> Mark W. Kimpel MD  ** Neuroinformatics ** Dept. of Psychiatry
> >>> Indiana University School of Medicine
> >>>
> >>> 15032 Hunter Court, Westfield, IN  46074
> >>>
> >>> (317) 490-5129 Work, & Mobile & VoiceMail
> >>> (317) 399-1219 Skype No Voicemail please
> >>>
> >>>        [[alternative HTML version deleted]]
> >>>
> >>>
> >>> ______________________________________________
> >>> R-help@r-project.org mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide
> >>> http://www.R-project.org/posting-guide.html
> >>> and provide commented, minimal, self-contained, reproducible code.
> >>>
> >>>
> >> Charles C. Berry                            (858) 534-2098
> >>                                            Dept of Family/Preventive
> >> Medicine
> >> E mailto:cbe...@tajo.ucsd.edu               UC San Diego
> >> http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego
> 92093-0901
> >>
> >>
> >>
> >
> >        [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
>
>
> --
> http://had.co.nz/
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] problem with split eating giga-bytes of memory

Reply via email to