Re: [R] Reading a csv file row by row

2007-04-06 Thread Yuchen Luo
Hi, my friends.
When a data file is large, loading the whole file into the memory all
together is not feasible. A feasible way  is to read one row, process it,
store the result, and read the next row.


In Fortran, by default, the 'read' command reads one line of a file, which
is convenient, and when the same 'read' command is executed the next time,
the next row of the same file will be read.

I tried to replicate such row-by-row reading in R.I use scan( ) to do so
with the skip= xxx  option. It takes only seconds when the number of the
rows is within 1000. However, it takes hours to read 1 rows. I think it
is because every time R reads, it needs to start from the first row of the
file and count xxx rows to find the row it needs to read. Therefore, it
takes more time for R to locate the row it needs to read.

Is there a solution to this problem?

Your help will be highly appreciated!

[[alternative HTML version deleted]]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Reading a csv file row by row

2007-04-06 Thread Martin Becker
readLines (which is mentioned in the See also section of ?scan with 
the hint to read a file a line at a time) should work.

Regards,
  Martin

Yuchen Luo schrieb:
 Hi, my friends.
 When a data file is large, loading the whole file into the memory all
 together is not feasible. A feasible way  is to read one row, process it,
 store the result, and read the next row.


 In Fortran, by default, the 'read' command reads one line of a file, which
 is convenient, and when the same 'read' command is executed the next time,
 the next row of the same file will be read.

 I tried to replicate such row-by-row reading in R.I use scan( ) to do so
 with the skip= xxx  option. It takes only seconds when the number of the
 rows is within 1000. However, it takes hours to read 1 rows. I think it
 is because every time R reads, it needs to start from the first row of the
 file and count xxx rows to find the row it needs to read. Therefore, it
 takes more time for R to locate the row it needs to read.

 Is there a solution to this problem?

 Your help will be highly appreciated!

   [[alternative HTML version deleted]]

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Reading a csv file row by row

2007-04-06 Thread ronggui
And _file()_ is helpful in such situation.

R/S-PLUS Fundamentals and Programming Techniques by Thomas Lumley has
something relavant in page 185 (total page is 208).

I believe you can find it by googling.



On 4/6/07, Martin Becker [EMAIL PROTECTED] wrote:
 readLines (which is mentioned in the See also section of ?scan with
 the hint to read a file a line at a time) should work.

 Regards,
   Martin

 Yuchen Luo schrieb:
  Hi, my friends.
  When a data file is large, loading the whole file into the memory all
  together is not feasible. A feasible way  is to read one row, process it,
  store the result, and read the next row.
 
 
  In Fortran, by default, the 'read' command reads one line of a file, which
  is convenient, and when the same 'read' command is executed the next time,
  the next row of the same file will be read.
 
  I tried to replicate such row-by-row reading in R.I use scan( ) to do so
  with the skip= xxx  option. It takes only seconds when the number of the
  rows is within 1000. However, it takes hours to read 1 rows. I think it
  is because every time R reads, it needs to start from the first row of the
  file and count xxx rows to find the row it needs to read. Therefore, it
  takes more time for R to locate the row it needs to read.
 
  Is there a solution to this problem?
 
  Your help will be highly appreciated!
 
[[alternative HTML version deleted]]
 
  __
  R-help@stat.math.ethz.ch mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.



-- 
Ronggui Huang
Department of Sociology
Fudan University, Shanghai, China

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Reading a csv file row by row

2007-04-06 Thread Henrik Bengtsson
Hi.

On 4/6/07, Yuchen Luo [EMAIL PROTECTED] wrote:
 Hi, my friends.
 When a data file is large, loading the whole file into the memory all
 together is not feasible. A feasible way  is to read one row, process it,
 store the result, and read the next row.


 In Fortran, by default, the 'read' command reads one line of a file, which
 is convenient, and when the same 'read' command is executed the next time,
 the next row of the same file will be read.

 I tried to replicate such row-by-row reading in R.I use scan( ) to do so
 with the skip= xxx  option. It takes only seconds when the number of the
 rows is within 1000. However, it takes hours to read 1 rows. I think it
 is because every time R reads, it needs to start from the first row of the
 file and count xxx rows to find the row it needs to read. Therefore, it
 takes more time for R to locate the row it needs to read.

Yes, to skip rows scan() needs to locate every single row (line
feed/carriage return).  The only gain you get is that it does not have
to parse and store the contents of those skipped lines.

One solution is to first go through the file and register the file
position of the first character in every line, and then make use of
this in subsequent reads.  In order to do this, you have to work with
an opened connection and pass that to scan instead.  Rough sketch:

con - file(pathname, open=r)

# Scan file for first position of every line
rowStarts - scanForRowStarts(con);

# Skip to a certain row and read a set of lines:
seek(con, where=rowStarts, origin=start, rw=r)
data - scan(con, ..., skip=0, nlines=rowsPerChunk)

close(con)

That's the idea.  The tricky part is to get scanForRowStarts()
correct.  After reading a line you can always query the connection for
the current file position using:

  pos - seek(con, rw=r)

so you could always iterate between readLines(con, n=1) and pos -
c(pos, seek(con, rw=r)), but there might be a faster way.

Cheers

/Henrik


 Is there a solution to this problem?

 Your help will be highly appreciated!

 [[alternative HTML version deleted]]

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.