Re: [R] Exceptional slowness with read.csv

Dave Dixon Wed, 10 Apr 2024 09:20:31 -0700

That's basically what I did

1. Get text lines using readLines
2. use tryCatch to parse each line using read.csv(text=...)

3. in the catch, use gregexpr to find any quotes not adjacent to a comma(gregexpr("[^,]\"[^,]",...)4. escape any quotes found by adding a second quote (using str_sub fromstringr)

6. parse the patched text using read.csv(text=...)

7. write out the parsed fields as I go along using write.table(...,append=TRUE) so I'm not keeping too much in memory.

I went directly to tryCatch because there were 3.5 million records, andI only expected a few to have errors.

I found only 6 bad records, but it had to be done to make the datafileusable with read.csv(), for the benefit of other researchers using thesedata.



On 4/10/24 07:46, Rui Barradas wrote:

Às 06:47 de 08/04/2024, Dave Dixon escreveu:
Greetings,
I have a csv file of 76 fields and about 4 million records. I knowthat some of the records have errors - unmatched quotes,specifically. Reading the file with readLines and parsing the lineswith read.csv(text = ...) is really slow. I know that the first2459465 records are good. So I try this:
 > startTime <- Sys.time()
 > first_records <- read.csv(file_name, nrows = 2459465)
 > endTime <- Sys.time()
 > cat("elapsed time = ", endTime - startTime, "\n")

elapsed time =   24.12598

 > startTime <- Sys.time()
 > second_records <- read.csv(file_name, skip = 2459465, nrows = 5)
 > endTime <- Sys.time()
 > cat("elapsed time = ", endTime - startTime, "\n")

This appears to never finish. I have been waiting over 20 minutes.
So why would (skip = 2459465, nrows = 5) take orders of magnitudelonger than (nrows = 2459465) ?
Thanks!

-dave

PS: readLines(n=2459470) takes 10.42731 seconds.

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Hello,

Can the following function be of help?
After reading the data setting argument quote=FALSE, call a functionapplying gregexpr to its character columns, then transforming theoutput in a two column data.frame with columns
 Col - the column processed;
 Unbalanced - the rows with unbalanced double quotes.
I am assuming the quotes are double quotes. It shouldn't be difficultto adapt it to other cas, single quotes, both cases.
unbalanced_dquotes <- function(x) {
  char_cols <- sapply(x, is.character) |> which()
  lapply(char_cols, \(i) {
    y <- x[[i]]
    Unbalanced <- gregexpr('"', y) |>
      sapply(\(x) attr(x, "match.length") |> length()) |>
      {\(x) (x %% 2L) == 1L}() |>
      which()
    data.frame(Col = i, Unbalanced = Unbalanced)
  }) |>
  do.call(rbind, args = _)
}

# read the data disregardin g quoted strings
df1 <- read.csv(fl, quote = "")
# determine which strings have unbalanced quotes and
# where
unbalanced_dquotes(df1)


Hope this helps,

Rui Barradas


______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Exceptional slowness with read.csv

Reply via email to