Try reading the lines in (readLines), count the number of both types of
quotes in each line. Find out which are not even and investigate.

On Mon, Apr 8, 2024, 15:24 Dave Dixon <ddi...@swcp.com> wrote:

> I solved the mystery, but not the problem. The problem is that there's
> an unclosed quote somewhere in those 5 additional records I'm trying to
> access. So read.csv is reading million-character fields. It's slow at
> that. That mystery solved.
>
> However, the the problem persists: how to fix what is obvious to the
> naked eye - a quote not adjacent to a comma - but that read.csv can't
> handle. readLines followed by read.csv(text= ) works great because, in
> that case, read.csv knows where the record terminates. Meaning, read.csv
> throws an exception that I can catch and handle with a quick and clean
> regex expression.
>
> Thanks, I'll take a look at vroom.
>
> -dave
>
> On 4/8/24 09:18, Stevie Pederson wrote:
> > Hi Dave,
> >
> > That's rather frustrating. I've found vroom (from the package vroom)
> > to be helpful with large files like this.
> >
> > Does the following give you any better luck?
> >
> > vroom(file_name, delim = ",", skip = 2459465, n_max = 5)
> >
> > Of course, when you know you've got errors & the files are big like
> > that it can take a bit of work resolving things. The command line
> > tools awk & sed might even be a good plan for finding lines that have
> > errors & figuring out a fix, but I certainly don't envy you.
> >
> > All the best
> >
> > Stevie
> >
> > On Tue, 9 Apr 2024 at 00:36, Dave Dixon <ddi...@swcp.com> wrote:
> >
> >     Greetings,
> >
> >     I have a csv file of 76 fields and about 4 million records. I know
> >     that
> >     some of the records have errors - unmatched quotes, specifically.
> >     Reading the file with readLines and parsing the lines with
> >     read.csv(text
> >     = ...) is really slow. I know that the first 2459465 records are
> >     good.
> >     So I try this:
> >
> >      > startTime <- Sys.time()
> >      > first_records <- read.csv(file_name, nrows = 2459465)
> >      > endTime <- Sys.time()
> >      > cat("elapsed time = ", endTime - startTime, "\n")
> >
> >     elapsed time =   24.12598
> >
> >      > startTime <- Sys.time()
> >      > second_records <- read.csv(file_name, skip = 2459465, nrows = 5)
> >      > endTime <- Sys.time()
> >      > cat("elapsed time = ", endTime - startTime, "\n")
> >
> >     This appears to never finish. I have been waiting over 20 minutes.
> >
> >     So why would (skip = 2459465, nrows = 5) take orders of magnitude
> >     longer
> >     than (nrows = 2459465) ?
> >
> >     Thanks!
> >
> >     -dave
> >
> >     PS: readLines(n=2459470) takes 10.42731 seconds.
> >
> >     ______________________________________________
> >     R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >     https://stat.ethz.ch/mailman/listinfo/r-help
> >     PLEASE do read the posting guide
> >     http://www.R-project.org/posting-guide.html
> >     <http://www.R-project.org/posting-guide.html>
> >     and provide commented, minimal, self-contained, reproducible code.
> >
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to