Apologies for the blank post. Too little caffeine at 5:30 AM.

On Oct 6, 2010, at 3:15 AM, Earl F. Glynn wrote:


I am trying to read a tab-delimited 1.25 GB file of 4,115,119 records each
with 52 fields.

I am using R 2.11.0 on a 64-bit Windows 7 machine with 8 GB memory.

I have tried the two following statements with the same results:

d <- read.delim(filename, as.is=TRUE)

d <- read.delim(filename, as.is=TRUE, nrows=4200000)

I have tried starting R with this parameter but that changed nothing:
--max-mem-size=6GB

Everything appeared to have worked fine until I studied frequency counts of
the fields and realized data were missing.

dim(d)
[1] 3388444      52

R read 3,388,444 records and missed 726,754 records. There were no error
messages or exceptions.  I plotted a chart using the data and later
discovered not all the data were represented in the chart.

R didn't just read the first 3,388,444 records and quit.

Here's what I believe happened (based on frequency counts of the first field
in the data.frame from R, and independently from another source):
* R read the first 1,866,296 records and then skipped 419,340 records.
* Next, R read 1,325,552 records and skipped 307,414 records.
* R read the last 196,596 records without any problems.

Questions:

Is there some memory-related parameter that I should adjust that might
explain the observed details above?

Can't think of any.

Shouldn't read.delim catch this failure instead of being silent about
dropping data?

More likely you have mismatched quotes in your file and some fields are accumulating large amounts of text. You should do some tabulations on your text fields with nchar-based functions.


Thanks for any help with this.

Earl F Glynn
Overland Park, KS

--

David Winsemius, MD
West Hartford, CT

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to