RE: [R] Handling large data sets via scan()

Mulholland, Tom Thu, 03 Feb 2005 23:31:01 -0800

I'm sure others with more experience will answer this, but for what it is worth 
my experience suggests that memory issues are more often with the user and not 
the machine. I don't use Linux so I can't make specific comments about the 
capacity of your machine. However it appears that there is often a need for a 
copy of an object to be in memory while you are working on creating a new 
version. So if you can get a data.frame to be 1.4Gb it wouldn't leave much 
space if there needed to be an original and a copy for any reason. (I speculate 
that this may be the case rather than asserting it is the case.)


>From a practical point of view I assume that when you say you have 600 
>features that you are not going to use each and every one in the models that 
>you may generate. So is it practical to limit the features to those that you 
>wish to use before creating a data.frame?

In short if you really do need to work this way I suggest that you read as many 
of the frequent posts on memory issues until you are either fully conversant 
with memory issues with the machine you have or you have found one of the many 
suggestions to work around this issue, such as working with a database and sql. 
Using "large dataset" as a query on Jonathon Baron's website gave over 400 
hits. http://finzi.psych.upenn.edu/nmz.html

Tom

> -----Original Message-----
> From: Nawaaz Ahmed [mailto:[EMAIL PROTECTED]
> Sent: Friday, 4 February 2005 2:40 PM
> To: R-help@stat.math.ethz.ch
> Cc: [EMAIL PROTECTED]
> Subject: [R] Handling large data sets via scan()
> 
> 
> I'm trying to read in datasets with roughly 150,000 rows and 600
> features. I wrote a function using scan() to read it in (I have a 4GB
> linux machine) and it works like a charm.  Unfortunately, 
> converting the
> scanned list into a datafame using as.data.frame() causes the memory
> usage to explode (it can go from 300MB for the scanned list 
> to 1.4GB for
> a data.frame of 30000 rows) and it fails claiming it cannot allocate
> memory (though it is still not close to the 3GB limit per 
> process on my
> linux box - the message is "unable to allocate vector of size 522K"). 
> 
> So I have three questions --
> 
> 1) Why is it failing even though there seems to be enough 
> memory available?
> 
> 2) Why is converting it into a data.frame causing the memory usage to
> explode? Am I using as.data.frame() wrongly? Should I be using some
> other command?
> 
> 3) All the model fitting packages seem to want to use data.frames as
> their input. If I cannot convert my list into a data.frame what can I
> do? Is there any way of getting around this?
> 
> Much thanks!
> Nawaaz
> 
> ______________________________________________
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
http://www.R-project.org/posting-guide.html

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

RE: [R] Handling large data sets via scan()

Reply via email to