As some of the conversation has noted the 30 second mark as an arbitrary benchmark I would also chime in that there is also an assumption that any non-R related issues that impact upon being able to usefully use R should be ignored. In the real world we can't always control everything about our environment. So if there are improvements that can be made that help mitigate the reality of the world, I would welcome them.
As a little test I broke the rules of my organisation and actually put a dataset on my C: drive. Not unexpectedly, the performance vastly improved. What would in the normal (at home) be a 10 second load becomes a 40 second load in a corporate environment. I have found the conversation helpful and it would appear that there are opportunities for improvement that I would find helpful in my production environment. The other aside is that I have no UNIX like tools, not because they don't exist, but because the environment I work in does not allow me to use them. This is not sufficient reason for me to bleat about it. It just is. By and large, I just get on with it. My point is that while I accept that these issues are peripheral to R, they do impact upon the useability of R. I'm sure that there are people working with large databases in R (The SPSS datasets that I regularly interact with vary between 97MB and 200MB) It could be finger trouble on my part, but I find I have to subset them before I can read them into R. If I thought I could usefully convert these datasets into something that R could pick and choose from without reaching the out of memory problem, I would be very happy. In the meantime my lack of expertise has left me with a workable albeit clumsy process. I will continue to champion R in my organisation, but the present score is SPSS-50, SAS-149, R-1. But all the really creative charts only come from one engine in this place. > system.time(load("P:/.../0203Mapdata.rdata")) [1] 9.79 0.97 37.45 NA NA > system.time(load("C:/TEMP/0203Mapdata.rdata")) [1] 10.07 0.18 10.49 NA NA > version _ platform i386-pc-mingw32 arch i386 os mingw32 system i386, mingw32 status major 1 minor 7.1 year 2003 month 06 day 16 language R _________________________________________________ Tom Mulholland Senior Policy Officer WA Country Health Service Tel: (08) 9222 4062 The contents of this e-mail transmission are confidential and may be protected by professional privilege. The contents are intended only for the named recipients of this e-mail. If you are not the intended recipient, you are hereby notified that any use, reproduction, disclosure or distribution of the information contained in this e-mail is prohibited. Please notify the sender immediately. -----Original Message----- From: Murray Jorgensen [mailto:[EMAIL PROTECTED] Sent: Monday, 25 August 2003 5:16 PM To: Prof Brian Ripley Cc: R-help Subject: Re: [R] R tools for large files At 08:12 25/08/2003 +0100, Prof Brian Ripley wrote: >I think that is only a medium-sized file. "Large" for my purposes means "more than I really want to read into memory" which in turn means "takes more than 30s". I'm at home now and the file isn't so I'm not sure if the file is large or not. More responses interspesed below. BTW, I forgot to mention that I'm using Windows and so do not have nice unix tools readily available. >On Mon, 25 Aug 2003, Murray Jorgensen wrote: > >> I'm wondering if anyone has written some functions or code for >> handling >> very large files in R. I am working with a data file that is 41 >> variables times who knows how many observations making up 27MB altogether. >> >> The sort of thing that I am thinking of having R do is >> >> - count the number of lines in a file > >You can do that without reading the file into memory: use >system(paste("wc -l", filename)) Don't think that I can do that in Windows XL. or read in blocks of lines via a >connection But that does sound promising! > >> - form a data frame by selecting all cases whose line numbers are in >> a >> supplied vector (which could be used to extract random subfiles of >> particular sizes) > >R should handle that easily in today's memory sizes. Buy some more RAM >if >you don't already have 1/2Gb. As others have said, for a real large file, >use a RDBMS to do the selection for you. It's just that R is so good in reading in initial segments of a file that I can't believe that it can't be effective in reading more general (pre-specified) subsets. Murray > >-- >Brian D. Ripley, [EMAIL PROTECTED] >Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ >University of Oxford, Tel: +44 1865 272861 (self) >1 South Parks Road, +44 1865 272866 (PA) >Oxford OX1 3TG, UK Fax: +44 1865 272595 > Dr Murray Jorgensen http://www.stats.waikato.ac.nz/Staff/maj.html Department of Statistics, University of Waikato, Hamilton, New Zealand Email: [EMAIL PROTECTED] Fax 7 838 4155 Phone +64 7 838 4773 wk +64 7 849 6486 home Mobile 021 1395 862 ______________________________________________ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help ______________________________________________ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help