Hi all,
The small modification was replacing
Write.Rows - Chunk[Chunk.Sel - Cuts[i], ] # (2nd line from the end)
by
Write.Rows - Chunk[Chunk.Sel - Cuts[i] ]# Chunk has one dimension only
Running times:
- For the Jim Holtman solution (reading once, using diff and skiping
from one record to
Thank so much for your help and comments.
The approach proposed by Jim Holtman was the simplest and fastest. The
approach by Marc Schwartz also worked (after a very small
modification).
It is clear that a good knowledge of R save a lot of time!! I've been
able to do in few minutes a process that
On Sat, 2007-02-03 at 19:06 +0100, juli g. pausas wrote:
Thank so much for your help and comments.
The approach proposed by Jim Holtman was the simplest and fastest. The
approach by Marc Schwartz also worked (after a very small
modification).
It is clear that a good knowledge of R save a
Hi.
General idea:
1. Open your file as a connection, i.e. con - file(pathname, open=r)
2. Generate a row to (file offset, row length) map of your text file,
i.e. a numeric vector 'fileOffsets' and 'rowLengths'. Use readBin()
for this. You build this up as you go by reading the file in chunks
Forgot to say, in your script you're reading the rows unordered
meaning you're jumping around in the file and there is no way the
hardware or the file caching system can optimize that. I'm pretty
sure you would see a substantial speedup if you did:
sel - sort(sel);
/H
On 2/2/07, Henrik
On Fri, 2007-02-02 at 18:40 +0100, juli g. pausas wrote:
Hi all,
I have a large file (1.8 GB) with 900,000 lines that I would like to read.
Each line is a string characters. Specifically I would like to randomly
select 3000 lines. For smaller files, what I'm doing is:
trs - scan(myfile,
I had a file with 200,000 lines in it and it took 1 second to select
3000 sample lines out of it. One of the things is to use a connection
so that the file stays opens and then just 'skip' to the next record
to read:
input - file(/tempxx.txt, r)
sel - 3000
remaining - 20
# get the
I suspect that reading from a connection in chunks of say 10,000 rows and
discarding those you do not want would be simpler and at least as quick.
Not least because seek() on Windows is so unreliable.
On Fri, 2 Feb 2007, Henrik Bengtsson wrote:
Hi.
General idea:
1. Open your file as a
On Fri, 2007-02-02 at 12:32 -0600, Marc Schwartz wrote:
On Fri, 2007-02-02 at 18:40 +0100, juli g. pausas wrote:
Hi all,
I have a large file (1.8 GB) with 900,000 lines that I would like to read.
Each line is a string characters. Specifically I would like to randomly
select 3000 lines.
On Fri, 2007-02-02 at 12:42 -0600, Marc Schwartz wrote:
On Fri, 2007-02-02 at 12:32 -0600, Marc Schwartz wrote:
Juli,
I don't have a file to test this on, so caveat emptor.
The problem with the approach above, is that you are re-reading the
source file, once per line, or 3000
10 matches
Mail list logo