Dave,

Your method works for you and seems to be a one-time fix of a corrupted data 
file so please accept what I write not as a criticism but explaining my 
alternate reasoning which I suspect may work faster in some situations.

Here is my understanding of what you are doing:

You have a file in CSV format containing N rows with commas to make M columns. 
A few rows have a glitch in that there is a double quote character at the 
beginning or end (meaning between commas adjacent to one, or perhaps at the 
beginning or end of the line of text) that mess things up. This may be in a 
specific known column or in several.

So your algorithm is to read the entire file in, or alternately you could do 
one at a time. Note the types of the columns may not be apparent to you when 
you start as you are not allowing read.csv() see what it needs to or perform 
all kinds of processing like dealing with a comment.
You then call functions millions of times (N) such as read.csv(). Argh!

You do that by setting up an environment N times to catch errors. Of course, 
most lines are fine and no error.

Only on error lines do you check for a regular expression that checks for 
quotes not immediately adjacent to a comma. I am not sure what you used albeit 
I imagine sometimes spaces could intervene. You fix any such lines and 
re-evaluate.

It seems your goal was to rewrite a corrected file so you are doing so while 
appending to it a row/line at a time.

My strategy was a bit different.

- Call read.csv() just once with no error checking but an option to not treat a 
quote specially. Note if the quoted region may contain commas, this is a bad 
strategy. If all it has is spaces or other non-comma items, it may be fine. 

There is now a data.frame or other similar data structure in memory if it works 
with N rows and M columns.

- Pick only columns that may have this issue, meaning the ones containing say 
text as compared to numbers or logical values.
- Using those columns, perhaps one at a time, evaluate them all at once for a 
regular expression that tests the entry for the presence of exactly one quote 
either at the start or end (the commas you used as anchors are not in this 
version.) So you are looking for something like:

"words perhaps including, commas
Or
words perhaps including, commas"

but not for:

words perhaps including, commas
"words perhaps including, commas"

You can save the query as a Boolean vector of TRUE/FALSE as one method, to mark 
which rows need fixing. Or you might use an ifelse() or the equivalent in which 
you selectively apply a fix to the rows. One method is to use something like 
sub() to both match all text except an initial or terminal quote and replace it 
with a quote followed by the match followed by a quote, if any quotes were 
found.

Whatever you choose can be done in a vectorized manner that may be more 
efficient. You do not need to check for failures, let alone N times. And you 
only need process those columns that need it.

When done, you may want to make sure all the columns are of the type you want 
as who knows if read.csv() made a bad choice on those columns, or others.

Note again, this is only a suggestion and it fails if commas can be part of the 
quoted parts or even misquoted parts.

-----Original Message-----
From: R-help <r-help-boun...@r-project.org> On Behalf Of Dave Dixon
Sent: Wednesday, April 10, 2024 12:20 PM
To: Rui Barradas <ruipbarra...@sapo.pt>; r-help@r-project.org
Subject: Re: [R] Exceptional slowness with read.csv

That's basically what I did

1. Get text lines using readLines
2. use tryCatch to parse each line using read.csv(text=...)
3. in the catch, use gregexpr to find any quotes not adjacent to a comma 
(gregexpr("[^,]\"[^,]",...)
4. escape any quotes found by adding a second quote (using str_sub from 
stringr)
6. parse the patched text using read.csv(text=...)
7. write out the parsed fields as I go along using write.table(..., 
append=TRUE) so I'm not keeping too much in memory.

I went directly to tryCatch because there were 3.5 million records, and 
I only expected a few to have errors.

I found only 6 bad records, but it had to be done to make the datafile 
usable with read.csv(), for the benefit of other researchers using these 
data.


On 4/10/24 07:46, Rui Barradas wrote:
> Às 06:47 de 08/04/2024, Dave Dixon escreveu:
>> Greetings,
>>
>> I have a csv file of 76 fields and about 4 million records. I know 
>> that some of the records have errors - unmatched quotes, 
>> specifically. Reading the file with readLines and parsing the lines 
>> with read.csv(text = ...) is really slow. I know that the first 
>> 2459465 records are good. So I try this:
>>
>>  > startTime <- Sys.time()
>>  > first_records <- read.csv(file_name, nrows = 2459465)
>>  > endTime <- Sys.time()
>>  > cat("elapsed time = ", endTime - startTime, "\n")
>>
>> elapsed time =   24.12598
>>
>>  > startTime <- Sys.time()
>>  > second_records <- read.csv(file_name, skip = 2459465, nrows = 5)
>>  > endTime <- Sys.time()
>>  > cat("elapsed time = ", endTime - startTime, "\n")
>>
>> This appears to never finish. I have been waiting over 20 minutes.
>>
>> So why would (skip = 2459465, nrows = 5) take orders of magnitude 
>> longer than (nrows = 2459465) ?
>>
>> Thanks!
>>
>> -dave
>>
>> PS: readLines(n=2459470) takes 10.42731 seconds.
>>
>> ______________________________________________
>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> Hello,
>
> Can the following function be of help?
> After reading the data setting argument quote=FALSE, call a function 
> applying gregexpr to its character columns, then transforming the 
> output in a two column data.frame with columns
>
>  Col - the column processed;
>  Unbalanced - the rows with unbalanced double quotes.
>
> I am assuming the quotes are double quotes. It shouldn't be difficult 
> to adapt it to other cas, single quotes, both cases.
>
>
>
>
> unbalanced_dquotes <- function(x) {
>   char_cols <- sapply(x, is.character) |> which()
>   lapply(char_cols, \(i) {
>     y <- x[[i]]
>     Unbalanced <- gregexpr('"', y) |>
>       sapply(\(x) attr(x, "match.length") |> length()) |>
>       {\(x) (x %% 2L) == 1L}() |>
>       which()
>     data.frame(Col = i, Unbalanced = Unbalanced)
>   }) |>
>   do.call(rbind, args = _)
> }
>
> # read the data disregardin g quoted strings
> df1 <- read.csv(fl, quote = "")
> # determine which strings have unbalanced quotes and
> # where
> unbalanced_dquotes(df1)
>
>
> Hope this helps,
>
> Rui Barradas
>
>

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to