Re: [R] reading very large files

2007-02-04 Thread juli g. pausas
Hi all, The small modification was replacing Write.Rows - Chunk[Chunk.Sel - Cuts[i], ] # (2nd line from the end) by Write.Rows - Chunk[Chunk.Sel - Cuts[i] ]# Chunk has one dimension only Running times: - For the Jim Holtman solution (reading once, using diff and skiping from one record to

Re: [R] reading very large files

2007-02-04 Thread juli g. pausas
Thank so much for your help and comments. The approach proposed by Jim Holtman was the simplest and fastest. The approach by Marc Schwartz also worked (after a very small modification). It is clear that a good knowledge of R save a lot of time!! I've been able to do in few minutes a process that

Re: [R] reading very large files

2007-02-03 Thread Marc Schwartz
On Sat, 2007-02-03 at 19:06 +0100, juli g. pausas wrote: Thank so much for your help and comments. The approach proposed by Jim Holtman was the simplest and fastest. The approach by Marc Schwartz also worked (after a very small modification). It is clear that a good knowledge of R save a

Re: [R] reading very large files

2007-02-02 Thread Henrik Bengtsson
Hi. General idea: 1. Open your file as a connection, i.e. con - file(pathname, open=r) 2. Generate a row to (file offset, row length) map of your text file, i.e. a numeric vector 'fileOffsets' and 'rowLengths'. Use readBin() for this. You build this up as you go by reading the file in chunks

Re: [R] reading very large files

2007-02-02 Thread Henrik Bengtsson
Forgot to say, in your script you're reading the rows unordered meaning you're jumping around in the file and there is no way the hardware or the file caching system can optimize that. I'm pretty sure you would see a substantial speedup if you did: sel - sort(sel); /H On 2/2/07, Henrik

Re: [R] reading very large files

2007-02-02 Thread Marc Schwartz
On Fri, 2007-02-02 at 18:40 +0100, juli g. pausas wrote: Hi all, I have a large file (1.8 GB) with 900,000 lines that I would like to read. Each line is a string characters. Specifically I would like to randomly select 3000 lines. For smaller files, what I'm doing is: trs - scan(myfile,

Re: [R] reading very large files

2007-02-02 Thread jim holtman
I had a file with 200,000 lines in it and it took 1 second to select 3000 sample lines out of it. One of the things is to use a connection so that the file stays opens and then just 'skip' to the next record to read: input - file(/tempxx.txt, r) sel - 3000 remaining - 20 # get the

Re: [R] reading very large files

2007-02-02 Thread Prof Brian Ripley
I suspect that reading from a connection in chunks of say 10,000 rows and discarding those you do not want would be simpler and at least as quick. Not least because seek() on Windows is so unreliable. On Fri, 2 Feb 2007, Henrik Bengtsson wrote: Hi. General idea: 1. Open your file as a

Re: [R] reading very large files

2007-02-02 Thread Marc Schwartz
On Fri, 2007-02-02 at 12:32 -0600, Marc Schwartz wrote: On Fri, 2007-02-02 at 18:40 +0100, juli g. pausas wrote: Hi all, I have a large file (1.8 GB) with 900,000 lines that I would like to read. Each line is a string characters. Specifically I would like to randomly select 3000 lines.

Re: [R] reading very large files

2007-02-02 Thread Marc Schwartz
On Fri, 2007-02-02 at 12:42 -0600, Marc Schwartz wrote: On Fri, 2007-02-02 at 12:32 -0600, Marc Schwartz wrote: Juli, I don't have a file to test this on, so caveat emptor. The problem with the approach above, is that you are re-reading the source file, once per line, or 3000