Re: [I] [R] open_dataset() behavior with incorrectly quoted input data [arrow]

via GitHub Tue, 10 Oct 2023 14:03:11 -0700


amoeba commented on issue #37908:
URL: https://github.com/apache/arrow/issues/37908#issuecomment-1756253850


   Hi @angela-li, when I've run into situations like yours in the past, I've 
resorted to adding a cleanup step in between the raw data and the less flexible 
system (in this case, arrow) in order to get the raw data in a form that can be 
read without issues. I can imagine this might not be practical for your use 
case.
   
   This comment got me thinking,
   
   > Changing this option is also not good for the rest of the data, where I do 
want the quote_char to be ".
   
   One other thing you might try that arrow can do right now would be to make 
use of arrow's UnionDataset functionality. As described above, you essentially 
need to parse some files with one set of rules and other files with another. 
`open_dataset` can actually open other Datasets so you could do something like,
   
   ```r
   my_ds <- open_dataset(
     list(
       open_dataset("good_file.txt", type = "text")
       open_dataset("bad_file.txt", type = "text", parse_options = 
CsvParseOptions$create(...))
     )
   ) # <- this is a UnionDataset
   ```
   
   From here you can work with `my_ds` normally.
   
   This problem also reminds me of lubridate and its `orders` argument in 
[`lubridate::parse_date_time`](https://lubridate.tidyverse.org/reference/parse_date_time.html).
 One limitation of the above approach is that it requires you to know which 
files are problematic and which are not. So an idea would be to create a list 
of `CsvParseOptions` objects, try opening your files in a `tryCatch` as you try 
each option. I've included hacky example below.
   
   <details>
   <summary>flexible_open_dataset.R</summary>
   
   ```r
   library(arrow)
   
   # First create a set of CsvParseOptions to try. Order matters.
   default_parse_options <- CsvParseOptions$create(delimiter = "|")
   quirk_parse_options <- CsvParseOptions$create(delimiter = "|", quote_char = 
'')
   my_parse_options <- c(default_parse_options, quirk_parse_options)
   
   # Then we define two helper functions that attempt to call open_dataset 
until one succeeds
   flexible_open_dataset_single <- function(file, parse_options) {
     for (parse_option in parse_options) {
       ds <- tryCatch({
         open_dataset(file, format = "text", parse_options = parse_option)
       },
       error = function(e) {
         warning(
           "Failed to parse ", file,
           " with provided ParseOption. Trying any remaining options...")
         NULL
       })
   
       if (!is.null(ds)) {
         break;
       }
     }
   
     ds
   }
   
   flexible_open_dataset <- function(files, parse_options) {
     open_dataset(lapply(files, function(f) { flexible_open_dataset_single(f, 
parse_options) }))
   }
   
   # Then, finally, we use our new helper and this should print a warning but 
otherwise work
   my_ds <- flexible_open_dataset(c("test_data.txt", "test_data_good.txt"), 
my_parse_options)
   ```
   </details>
   
   If we wanted to provide something like this in arrow, one way would be to 
allow `parse_options` to take multiple values and use a similar mechanism 
internally to try each.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [R] open_dataset() behavior with incorrectly quoted input data [arrow]

Reply via email to