subject:"Re\: \[R\] reading very large files"

Re: [R] reading very large files

2007-02-04 Thread juli g. pausas

Hi all,
The small modification was replacing
Write.Rows - Chunk[Chunk.Sel - Cuts[i], ]   # (2nd line from the end)
by
Write.Rows - Chunk[Chunk.Sel - Cuts[i] ]# Chunk has one dimension only

Running times:
- For the Jim Holtman solution (reading once, using diff and skiping
from one record to the other)
[1] 49.80  0.27 50.96NANA

- For Marc Schwartz solution (reading in chunks of 10)
[1] 1203.949.12 1279.04  NA  NA

Both in R2.4.1, under Windows:
 R.version
   _
platform   i386-pc-mingw32
arch   i386
os mingw32
system i386, mingw32
status
major  2
minor  4.1
year   2006
month  12
day18
svn rev40228
language   R
version.string R version 2.4.1 (2006-12-18)


Juli




On 03/02/07, Marc Schwartz [EMAIL PROTECTED] wrote:
 On Sat, 2007-02-03 at 19:06 +0100, juli g. pausas wrote:
  Thank so much for your help and comments.
  The approach proposed by Jim Holtman was the simplest and fastest. The
  approach by Marc Schwartz also worked (after a very small
  modification).
 
  It is clear that a good knowledge of R save a lot of time!! I've been
  able to do in few minutes a process that was only 1/4th done after 25
  h!
 
  Many thanks
 
  Juli

 Juli,

 Just out of curiosity, what change did you make?

 Also, what were the running times for the solutions?

 Regards,

 Marc





-- 
http://www.ceam.es/pausas

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] reading very large files

2007-02-04 Thread juli g. pausas

Thank so much for your help and comments.
The approach proposed by Jim Holtman was the simplest and fastest. The
approach by Marc Schwartz also worked (after a very small
modification).

It is clear that a good knowledge of R save a lot of time!! I've been
able to do in few minutes a process that was only 1/4th done after 25
h!

Many thanks

Juli


On 02/02/07, juli g. pausas [EMAIL PROTECTED] wrote:
 Hi all,
 I have a large file (1.8 GB) with 900,000 lines that I would like to read.
 Each line is a string characters. Specifically I would like to randomly
 select 3000 lines. For smaller files, what I'm doing is:

 trs - scan(myfile, what= character(), sep = \n)
  trs- trs[sample(length(trs), 3000)]

 And this works OK; however my computer seems not able to handle the 1.8 G
 file.
 I thought of an alternative way that not require to read the whole file:

 sel - sample(1:90, 3000)
 for (i in 1:3000)  {
 un - scan(myfile, what= character(), sep = \n, skip=sel[i], nlines=1)
  write(un, myfile_short, append=TRUE)
 }

 This works on my computer; however it is extremely slow; it read one line
 each time. It is been running for 25 hours and I think it has done less than
 half of the file (Yes, probably I do not have a very good computer and I'm
 working under Windows ...).
 So my question is: do you know any other faster way to do this?
 Thanks in advance

 Juli

 --
  http://www.ceam.es/pausas



-- 
http://www.ceam.es/pausas

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] reading very large files

2007-02-03 Thread Marc Schwartz

On Sat, 2007-02-03 at 19:06 +0100, juli g. pausas wrote:
 Thank so much for your help and comments.
 The approach proposed by Jim Holtman was the simplest and fastest. The
 approach by Marc Schwartz also worked (after a very small
 modification).
 
 It is clear that a good knowledge of R save a lot of time!! I've been
 able to do in few minutes a process that was only 1/4th done after 25
 h!
 
 Many thanks
 
 Juli

Juli,

Just out of curiosity, what change did you make?

Also, what were the running times for the solutions?

Regards,

Marc

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] reading very large files

2007-02-02 Thread Henrik Bengtsson

Hi.

General idea:

1. Open your file as a connection, i.e. con - file(pathname, open=r)

2. Generate a row to (file offset, row length) map of your text file,
i.e. a numeric vector 'fileOffsets' and 'rowLengths'.  Use readBin()
for this. You build this up as you go by reading the file in chunks
meaning you can handles files of any size.  You can store this lookup
map to file for your future R sessions.

3. Sample a set of rows r = (r1, r2, ..., rR), i.e. rows =
sample(length(fileOffsets)).

4. Look up the file offsets and row lengths for these rows, i.e.
offsets = fileOffsets[rows].  lengths = rowLengths[rows].

5. In case your subset of rows is not ordered, it is wise to order
them first to speed up things.  If order is important, keep track of
the ordering and re-order them at then end.

6. For each row r, use seek(con=con, where=offsets[r]) to jump to the
start of the row.  Use readBin(..., n=lengths[r]) to read the data.

7. Repeat from (3).

/Henrik

On 2/2/07, juli g. pausas [EMAIL PROTECTED] wrote:
 Hi all,
 I have a large file (1.8 GB) with 900,000 lines that I would like to read.
 Each line is a string characters. Specifically I would like to randomly
 select 3000 lines. For smaller files, what I'm doing is:

 trs - scan(myfile, what= character(), sep = \n)
 trs- trs[sample(length(trs), 3000)]

 And this works OK; however my computer seems not able to handle the 1.8 G
 file.
 I thought of an alternative way that not require to read the whole file:

 sel - sample(1:90, 3000)
 for (i in 1:3000)  {
 un - scan(myfile, what= character(), sep = \n, skip=sel[i], nlines=1)
 write(un, myfile_short, append=TRUE)
 }

 This works on my computer; however it is extremely slow; it read one line
 each time. It is been running for 25 hours and I think it has done less than
 half of the file (Yes, probably I do not have a very good computer and I'm
 working under Windows ...).
 So my question is: do you know any other faster way to do this?
 Thanks in advance

 Juli

 --
 http://www.ceam.es/pausas

 [[alternative HTML version deleted]]

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] reading very large files

2007-02-02 Thread Henrik Bengtsson

Forgot to say, in your script you're reading the rows unordered
meaning you're jumping around in the file and there is no way the
hardware or the file caching system can optimize that.  I'm pretty
sure you would see a substantial speedup if you did:

sel - sort(sel);

/H

On 2/2/07, Henrik Bengtsson [EMAIL PROTECTED] wrote:
 Hi.

 General idea:

 1. Open your file as a connection, i.e. con - file(pathname, open=r)

 2. Generate a row to (file offset, row length) map of your text file,
 i.e. a numeric vector 'fileOffsets' and 'rowLengths'.  Use readBin()
 for this. You build this up as you go by reading the file in chunks
 meaning you can handles files of any size.  You can store this lookup
 map to file for your future R sessions.

 3. Sample a set of rows r = (r1, r2, ..., rR), i.e. rows =
 sample(length(fileOffsets)).

 4. Look up the file offsets and row lengths for these rows, i.e.
 offsets = fileOffsets[rows].  lengths = rowLengths[rows].

 5. In case your subset of rows is not ordered, it is wise to order
 them first to speed up things.  If order is important, keep track of
 the ordering and re-order them at then end.

 6. For each row r, use seek(con=con, where=offsets[r]) to jump to the
 start of the row.  Use readBin(..., n=lengths[r]) to read the data.

 7. Repeat from (3).

 /Henrik

 On 2/2/07, juli g. pausas [EMAIL PROTECTED] wrote:
  Hi all,
  I have a large file (1.8 GB) with 900,000 lines that I would like to read.
  Each line is a string characters. Specifically I would like to randomly
  select 3000 lines. For smaller files, what I'm doing is:
 
  trs - scan(myfile, what= character(), sep = \n)
  trs- trs[sample(length(trs), 3000)]
 
  And this works OK; however my computer seems not able to handle the 1.8 G
  file.
  I thought of an alternative way that not require to read the whole file:
 
  sel - sample(1:90, 3000)
  for (i in 1:3000)  {
  un - scan(myfile, what= character(), sep = \n, skip=sel[i], nlines=1)
  write(un, myfile_short, append=TRUE)
  }
 
  This works on my computer; however it is extremely slow; it read one line
  each time. It is been running for 25 hours and I think it has done less than
  half of the file (Yes, probably I do not have a very good computer and I'm
  working under Windows ...).
  So my question is: do you know any other faster way to do this?
  Thanks in advance
 
  Juli
 
  --
  http://www.ceam.es/pausas
 
  [[alternative HTML version deleted]]
 
  __
  R-help@stat.math.ethz.ch mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] reading very large files

2007-02-02 Thread Marc Schwartz

On Fri, 2007-02-02 at 18:40 +0100, juli g. pausas wrote:
 Hi all,
 I have a large file (1.8 GB) with 900,000 lines that I would like to read.
 Each line is a string characters. Specifically I would like to randomly
 select 3000 lines. For smaller files, what I'm doing is:
 
 trs - scan(myfile, what= character(), sep = \n)
 trs- trs[sample(length(trs), 3000)]
 
 And this works OK; however my computer seems not able to handle the 1.8 G
 file.
 I thought of an alternative way that not require to read the whole file:
 
 sel - sample(1:90, 3000)
 for (i in 1:3000)  {
 un - scan(myfile, what= character(), sep = \n, skip=sel[i], nlines=1)
 write(un, myfile_short, append=TRUE)
 }
 
 This works on my computer; however it is extremely slow; it read one line
 each time. It is been running for 25 hours and I think it has done less than
 half of the file (Yes, probably I do not have a very good computer and I'm
 working under Windows ...).
 So my question is: do you know any other faster way to do this?
 Thanks in advance
 
 Juli


Juli,

I don't have a file to test this on, so caveat emptor.

The problem with the approach above, is that you are re-reading the
source file, once per line, or 3000 times.  In addition, each read is
likely going through half the file on average to locate the randomly
selected line. Thus, the reality is that you are probably reading on the
order of:

 3000 * 45
[1] 1.35e+09

lines in the file, which of course if going to be quite slow.

In addition, you are also writing to the target file 3000 times.

The basic premise with this approach below, is that you are in effect
creating a sequential file cache in an R object. Reading large chunks of
the source file into the cache. Then randomly selecting rows within the
cache and then writing out the selected rows.

Thus, if you can read 100,000 rows at once, you would have 9 reads of
the source file, and 9 writes of the target file.

The key thing here is to ensure that the offsets within the cache and
the corresponding random row values are properly set.

Here's the code:

# Generate the random values
sel - sample(1:90, 3000)

# Set up a sequence for the cache chunks
# Presume you can read 100,000 rows at once
Cuts - seq(0, 90, 10)

# Loop over the length of Cuts, less 1
for (i in seq(along = Cuts[-1]))
{
  # Get a 100,000 row chunk, skipping rows
  # as appropriate for each subsequent chunk
  Chunk - scan(myfile, what = character(), sep = \n, 
 skip = Cuts[i], nlines = 10)

  # set up a row sequence for the current 
  # chunk
  Rows - (Cuts[i] + 1):(Cuts[i + 1])

  # Are any of the random values in the 
  # current chunk?
  Chunk.Sel - sel[which(sel %in% Rows)]

  # If so, get them 
  if (length(Chunk.Sel)  0)
  {
Write.Rows - Chunk[sel - Cuts[i]]

# Now write them out
write(Write.Rows, myfile_short, append = TRUE)
  }
}


As noted, I have not tested this, so there may yet be additional ways to
save time with file seeks, etc.

HTH,

Marc Schwartz

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] reading very large files

2007-02-02 Thread jim holtman

I had a file with 200,000 lines in it and it took 1 second to select
3000 sample lines out of it.  One of the things is to use a connection
so that the file stays opens and then just 'skip' to the next record
to read:



 input - file(/tempxx.txt, r)
 sel - 3000
 remaining - 20
 # get the records numbers to select
 recs - sort(sample(1:remaining, sel))
 # compute number to skip on each read; account for the record just read
 skip - diff(c(1, recs)) - 1
 # allocate my data
 mysel - vector('character', sel)
 system.time({
+ for (i in 1:sel){
+ mysel[i] - scan(input, what=, sep=\n, skip=skip[i], n=1, quiet=TRUE)
+ }
+ })
[1] 0.97 0.02 1.00   NA   NA




On 2/2/07, juli g. pausas [EMAIL PROTECTED] wrote:
 Hi all,
 I have a large file (1.8 GB) with 900,000 lines that I would like to read.
 Each line is a string characters. Specifically I would like to randomly
 select 3000 lines. For smaller files, what I'm doing is:

 trs - scan(myfile, what= character(), sep = \n)
 trs- trs[sample(length(trs), 3000)]

 And this works OK; however my computer seems not able to handle the 1.8 G
 file.
 I thought of an alternative way that not require to read the whole file:

 sel - sample(1:90, 3000)
 for (i in 1:3000)  {
 un - scan(myfile, what= character(), sep = \n, skip=sel[i], nlines=1)
 write(un, myfile_short, append=TRUE)
 }

 This works on my computer; however it is extremely slow; it read one line
 each time. It is been running for 25 hours and I think it has done less than
 half of the file (Yes, probably I do not have a very good computer and I'm
 working under Windows ...).
 So my question is: do you know any other faster way to do this?
 Thanks in advance

 Juli

 --
 http://www.ceam.es/pausas

[[alternative HTML version deleted]]

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.



-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem you are trying to solve?

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] reading very large files

2007-02-02 Thread Prof Brian Ripley

I suspect that reading from a connection in chunks of say 10,000 rows and 
discarding those you do not want would be simpler and at least as quick.
Not least because seek() on Windows is so unreliable.

On Fri, 2 Feb 2007, Henrik Bengtsson wrote:

 Hi.

 General idea:

 1. Open your file as a connection, i.e. con - file(pathname, open=r)

 2. Generate a row to (file offset, row length) map of your text file,
 i.e. a numeric vector 'fileOffsets' and 'rowLengths'.  Use readBin()
 for this. You build this up as you go by reading the file in chunks
 meaning you can handles files of any size.  You can store this lookup
 map to file for your future R sessions.

 3. Sample a set of rows r = (r1, r2, ..., rR), i.e. rows =
 sample(length(fileOffsets)).

 4. Look up the file offsets and row lengths for these rows, i.e.
 offsets = fileOffsets[rows].  lengths = rowLengths[rows].

 5. In case your subset of rows is not ordered, it is wise to order
 them first to speed up things.  If order is important, keep track of
 the ordering and re-order them at then end.

 6. For each row r, use seek(con=con, where=offsets[r]) to jump to the
 start of the row.  Use readBin(..., n=lengths[r]) to read the data.

 7. Repeat from (3).

 /Henrik

 On 2/2/07, juli g. pausas [EMAIL PROTECTED] wrote:
 Hi all,
 I have a large file (1.8 GB) with 900,000 lines that I would like to read.
 Each line is a string characters. Specifically I would like to randomly
 select 3000 lines. For smaller files, what I'm doing is:

 trs - scan(myfile, what= character(), sep = \n)
 trs- trs[sample(length(trs), 3000)]

 And this works OK; however my computer seems not able to handle the 1.8 G
 file.
 I thought of an alternative way that not require to read the whole file:

 sel - sample(1:90, 3000)
 for (i in 1:3000)  {
 un - scan(myfile, what= character(), sep = \n, skip=sel[i], nlines=1)
 write(un, myfile_short, append=TRUE)
 }

 This works on my computer; however it is extremely slow; it read one line
 each time. It is been running for 25 hours and I think it has done less than
 half of the file (Yes, probably I do not have a very good computer and I'm
 working under Windows ...).
 So my question is: do you know any other faster way to do this?
 Thanks in advance

 Juli

 --
 http://www.ceam.es/pausas

 [[alternative HTML version deleted]]

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


-- 
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] reading very large files

2007-02-02 Thread Marc Schwartz

On Fri, 2007-02-02 at 12:32 -0600, Marc Schwartz wrote:
 On Fri, 2007-02-02 at 18:40 +0100, juli g. pausas wrote:
  Hi all,
  I have a large file (1.8 GB) with 900,000 lines that I would like to read.
  Each line is a string characters. Specifically I would like to randomly
  select 3000 lines. For smaller files, what I'm doing is:
  
  trs - scan(myfile, what= character(), sep = \n)
  trs- trs[sample(length(trs), 3000)]
  
  And this works OK; however my computer seems not able to handle the 1.8 G
  file.
  I thought of an alternative way that not require to read the whole file:
  
  sel - sample(1:90, 3000)
  for (i in 1:3000)  {
  un - scan(myfile, what= character(), sep = \n, skip=sel[i], nlines=1)
  write(un, myfile_short, append=TRUE)
  }
  
  This works on my computer; however it is extremely slow; it read one line
  each time. It is been running for 25 hours and I think it has done less than
  half of the file (Yes, probably I do not have a very good computer and I'm
  working under Windows ...).
  So my question is: do you know any other faster way to do this?
  Thanks in advance
  
  Juli
 
 
 Juli,
 
 I don't have a file to test this on, so caveat emptor.
 
 The problem with the approach above, is that you are re-reading the
 source file, once per line, or 3000 times.  In addition, each read is
 likely going through half the file on average to locate the randomly
 selected line. Thus, the reality is that you are probably reading on the
 order of:
 
  3000 * 45
 [1] 1.35e+09
 
 lines in the file, which of course if going to be quite slow.
 
 In addition, you are also writing to the target file 3000 times.
 
 The basic premise with this approach below, is that you are in effect
 creating a sequential file cache in an R object. Reading large chunks of
 the source file into the cache. Then randomly selecting rows within the
 cache and then writing out the selected rows.
 
 Thus, if you can read 100,000 rows at once, you would have 9 reads of
 the source file, and 9 writes of the target file.
 
 The key thing here is to ensure that the offsets within the cache and
 the corresponding random row values are properly set.
 
 Here's the code:
 
 # Generate the random values
 sel - sample(1:90, 3000)
 
 # Set up a sequence for the cache chunks
 # Presume you can read 100,000 rows at once
 Cuts - seq(0, 90, 10)
 
 # Loop over the length of Cuts, less 1
 for (i in seq(along = Cuts[-1]))
 {
   # Get a 100,000 row chunk, skipping rows
   # as appropriate for each subsequent chunk
   Chunk - scan(myfile, what = character(), sep = \n, 
  skip = Cuts[i], nlines = 10)
 
   # set up a row sequence for the current 
   # chunk
   Rows - (Cuts[i] + 1):(Cuts[i + 1])
 
   # Are any of the random values in the 
   # current chunk?
   Chunk.Sel - sel[which(sel %in% Rows)]
 
   # If so, get them 
   if (length(Chunk.Sel)  0)
   {
 Write.Rows - Chunk[sel - Cuts[i]]


Quick typo correction:

The last line above should be:

  Write.Rows - Chunk[sel - Cuts[i], ]


 # Now write them out
 write(Write.Rows, myfile_short, append = TRUE)
   }
 }
 
 
 As noted, I have not tested this, so there may yet be additional ways to
 save time with file seeks, etc.

If that's the only error in the code...   :-)

Marc

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] reading very large files

2007-02-02 Thread Marc Schwartz

On Fri, 2007-02-02 at 12:42 -0600, Marc Schwartz wrote:
 On Fri, 2007-02-02 at 12:32 -0600, Marc Schwartz wrote:

  Juli,
  
  I don't have a file to test this on, so caveat emptor.
  
  The problem with the approach above, is that you are re-reading the
  source file, once per line, or 3000 times.  In addition, each read is
  likely going through half the file on average to locate the randomly
  selected line. Thus, the reality is that you are probably reading on the
  order of:
  
   3000 * 45
  [1] 1.35e+09
  
  lines in the file, which of course if going to be quite slow.
  
  In addition, you are also writing to the target file 3000 times.
  
  The basic premise with this approach below, is that you are in effect
  creating a sequential file cache in an R object. Reading large chunks of
  the source file into the cache. Then randomly selecting rows within the
  cache and then writing out the selected rows.
  
  Thus, if you can read 100,000 rows at once, you would have 9 reads of
  the source file, and 9 writes of the target file.
  
  The key thing here is to ensure that the offsets within the cache and
  the corresponding random row values are properly set.
  
  Here's the code:
  
  # Generate the random values
  sel - sample(1:90, 3000)
  
  # Set up a sequence for the cache chunks
  # Presume you can read 100,000 rows at once
  Cuts - seq(0, 90, 10)
  
  # Loop over the length of Cuts, less 1
  for (i in seq(along = Cuts[-1]))
  {
# Get a 100,000 row chunk, skipping rows
# as appropriate for each subsequent chunk
Chunk - scan(myfile, what = character(), sep = \n, 
   skip = Cuts[i], nlines = 10)
  
# set up a row sequence for the current 
# chunk
Rows - (Cuts[i] + 1):(Cuts[i + 1])
  
# Are any of the random values in the 
# current chunk?
Chunk.Sel - sel[which(sel %in% Rows)]
  
# If so, get them 
if (length(Chunk.Sel)  0)
{
  Write.Rows - Chunk[sel - Cuts[i]]
 
 
 Quick typo correction:
 
 The last line above should be:
 
   Write.Rows - Chunk[sel - Cuts[i], ]
 
 
  # Now write them out
  write(Write.Rows, myfile_short, append = TRUE)
}
  }
  

OK, I knew it was too good to be true...

One more correction on that same line:

   Write.Rows - Chunk[Chunk.Sel - Cuts[i], ]


For clarity, here is the full set of code:

# Generate the random values
sel - sample(90, 3000)

# Set up a sequence for the cache chunks
# Presume you can read 100,000 rows at once
Cuts - seq(0, 90, 10)

# Loop over the length of Cuts, less 1
for (i in seq(along = Cuts[-1]))
{
  # Get a 100,000 row chunk, skipping rows
  # as appropriate for each subsequent chunk
  Chunk - scan(myfile, what = character(), sep = \n, 
 skip = Cuts[i], nlines = 10)

  # set up a row sequence for the current 
  # chunk
  Rows - (Cuts[i] + 1):(Cuts[i + 1])

  # Are any of the random values in the 
  # current chunk?
  Chunk.Sel - sel[which(sel %in% Rows)]

  # If so, get them 
  if (length(Chunk.Sel)  0)
  {
Write.Rows - Chunk[Chunk.Sel - Cuts[i], ]

# Now write them out
write(Write.Rows, myfile_short, append = TRUE)
  }
}


Regards,

Marc

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] reading very large files

Re: [R] reading very large files

Re: [R] reading very large files

Re: [R] reading very large files

Re: [R] reading very large files

Re: [R] reading very large files

Re: [R] reading very large files

Re: [R] reading very large files

Re: [R] reading very large files

Re: [R] reading very large files

10 matches

Site Navigation

Mail list logo

Footer information