etiennebacher commented on issue #34965:
URL: https://github.com/apache/arrow/issues/34965#issuecomment-1503243275
Thank you for your answer @thisisnic. The workaround you provided works in
this very simple case because there are only 2 columns, but I have tens or
hundreds of them in my scenario. I improved it a bit to detect the duplicated
names, repair them by adding a random suffix, and plugging them back:
``` r
library(arrow)
#>
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#>
#> timestamp
packageVersion("arrow")
#> [1] '11.0.0.3'
file_location <- tempfile(fileext = ".csv")
test <- data.frame(x = 1, x = 2, check.names = FALSE)
write.csv(test, file_location)
file <- read_csv_arrow(file_location, as_data_frame = FALSE)
# extract the schema from the table
my_schema <- file$schema
# we can see the duplicated names here
dupes <- which(duplicated(names(my_schema)))
for (i in dupes) {
# get original variable name and add a random suffix (so that the new name
# is not a duplicate of another one)
orig <- names(my_schema)[i]
set.seed(i)
suffix <- paste(sample(letters, 8), collapse = "")
new_var <- paste0(orig, "_", suffix)
# get the variable type
orig_field <- my_schema$fields[[i]]$type$code()
# update the variable
my_schema[[i]] <- field(new_var, eval(orig_field))
cat(paste("Old variable name:", orig, "\nNew variable name:", new_var,
"\n\n"))
}
#> Old variable name: x
#> New variable name: x_elgdhkvj
# open the dataset, specifying the new schema
# we have to include "skip" to skip the first row of the file
ds <- arrow::open_csv_dataset(file_location, schema = my_schema, skip = 1)
dplyr::collect(ds)
#> # A tibble: 1 × 3
#> `` x x_elgdhkvj
#> <int> <int> <int>
#> 1 1 1 2
```
(Note that I didn't check that this worked with more than 2 duplicated
names.)
Also, while this workaround is fast for small files, the original
`read_csv_arrow()` takes some time. Nothing crazy, but extended to dozens of
files, this can pile up and lead to an important delay. Maybe there's a faster
way to do this?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]