amoeba commented on issue #37908:
URL: https://github.com/apache/arrow/issues/37908#issuecomment-1756253850
Hi @angela-li, when I've run into situations like yours in the past, I've
resorted to adding a cleanup step in between the raw data and the less flexible
system (in this case, arrow) in order to get the raw data in a form that can be
read without issues. I can imagine this might not be practical for your use
case.
This comment got me thinking,
> Changing this option is also not good for the rest of the data, where I do
want the quote_char to be ".
One other thing you might try that arrow can do right now would be to make
use of arrow's UnionDataset functionality. As described above, you essentially
need to parse some files with one set of rules and other files with another.
`open_dataset` can actually open other Datasets so you could do something like,
```r
my_ds <- open_dataset(
list(
open_dataset("good_file.txt", type = "text")
open_dataset("bad_file.txt", type = "text", parse_options =
CsvParseOptions$create(...))
)
) # <- this is a UnionDataset
```
From here you can work with `my_ds` normally.
This problem also reminds me of lubridate and its `orders` argument in
[`lubridate::parse_date_time`](https://lubridate.tidyverse.org/reference/parse_date_time.html).
One limitation of the above approach is that it requires you to know which
files are problematic and which are not. So an idea would be to create a list
of `CsvParseOptions` objects, try opening your files in a `tryCatch` as you try
each option. I've included hacky example below.
<details>
<summary>flexible_open_dataset.R</summary>
```r
library(arrow)
# First create a set of CsvParseOptions to try. Order matters.
default_parse_options <- CsvParseOptions$create(delimiter = "|")
quirk_parse_options <- CsvParseOptions$create(delimiter = "|", quote_char =
'')
my_parse_options <- c(default_parse_options, quirk_parse_options)
# Then we define two helper functions that attempt to call open_dataset
until one succeeds
flexible_open_dataset_single <- function(file, parse_options) {
for (parse_option in parse_options) {
ds <- tryCatch({
open_dataset(file, format = "text", parse_options = parse_option)
},
error = function(e) {
warning(
"Failed to parse ", file,
" with provided ParseOption. Trying any remaining options...")
NULL
})
if (!is.null(ds)) {
break;
}
}
ds
}
flexible_open_dataset <- function(files, parse_options) {
open_dataset(lapply(files, function(f) { flexible_open_dataset_single(f,
parse_options) }))
}
# Then, finally, we use our new helper and this should print a warning but
otherwise work
my_ds <- flexible_open_dataset(c("test_data.txt", "test_data_good.txt"),
my_parse_options)
```
</details>
If we wanted to provide something like this in arrow, one way would be to
allow `parse_options` to take multiple values and use a similar mechanism
internally to try each.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]