Hello, all.

I am working on a project with a large (~350Mb, about 5800 rows) insurance 
claims dataset. It was supplied in a tilde(~)-delimited format. I imported it 
into a data frame in R by setting memory.limit to maximum (4Gb) for my computer 
and using read.table. 

The resulting data frame had 10 bad rows. The errors appear due to read.table 
missing delimiter characters, with multiple data being imported into the same 
cell, then the remainder of the row and the next run together and garbled due 
to the reading frame shift (example: a single cell might contain: <datum>~ ~ 
<datum> ~<datum>, after which all the cells of the row and the next are wrong). 

To replicate, I tried the same import procedure on a smaller demographics data 
set from the same supplier- only about 1Mb, and got the same kinds of errors (5 
bad rows in about 3500). I also imported as much of the file as Excel would 
hold and cross-checked, Excel did not produce the same errors but can't handle 
the entire file. I have used read.table on a number of other formats (mainly 
csv and tab-delimited) without such problems; so far it appears there's 
something different about these files that produces the errors but I can't see 
what it would be.

Does anyone have any thoughts about what is going wrong? And is there a way, 
short of manual correction, for fixing it?

Thanks for all help,
~Pat.


Pat Carroll. 
what matters most is how well you walk through the fire. 
bukowski.

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to