I solved the mystery, but not the problem. The problem is that there's an unclosed quote somewhere in those 5 additional records I'm trying to access. So read.csv is reading million-character fields. It's slow at that. That mystery solved.
However, the the problem persists: how to fix what is obvious to the naked eye - a quote not adjacent to a comma - but that read.csv can't handle. readLines followed by read.csv(text= ) works great because, in that case, read.csv knows where the record terminates. Meaning, read.csv throws an exception that I can catch and handle with a quick and clean regex expression. Thanks, I'll take a look at vroom. -dave On 4/8/24 09:18, Stevie Pederson wrote: > Hi Dave, > > That's rather frustrating. I've found vroom (from the package vroom) > to be helpful with large files like this. > > Does the following give you any better luck? > > vroom(file_name, delim = ",", skip = 2459465, n_max = 5) > > Of course, when you know you've got errors & the files are big like > that it can take a bit of work resolving things. The command line > tools awk & sed might even be a good plan for finding lines that have > errors & figuring out a fix, but I certainly don't envy you. > > All the best > > Stevie > > On Tue, 9 Apr 2024 at 00:36, Dave Dixon <ddi...@swcp.com> wrote: > > Greetings, > > I have a csv file of 76 fields and about 4 million records. I know > that > some of the records have errors - unmatched quotes, specifically. > Reading the file with readLines and parsing the lines with > read.csv(text > = ...) is really slow. I know that the first 2459465 records are > good. > So I try this: > > > startTime <- Sys.time() > > first_records <- read.csv(file_name, nrows = 2459465) > > endTime <- Sys.time() > > cat("elapsed time = ", endTime - startTime, "\n") > > elapsed time = 24.12598 > > > startTime <- Sys.time() > > second_records <- read.csv(file_name, skip = 2459465, nrows = 5) > > endTime <- Sys.time() > > cat("elapsed time = ", endTime - startTime, "\n") > > This appears to never finish. I have been waiting over 20 minutes. > > So why would (skip = 2459465, nrows = 5) take orders of magnitude > longer > than (nrows = 2459465) ? > > Thanks! > > -dave > > PS: readLines(n=2459470) takes 10.42731 seconds. > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > <http://www.R-project.org/posting-guide.html> > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.