etiennebacher commented on issue #34965:
URL: https://github.com/apache/arrow/issues/34965#issuecomment-1505196369
Yes the column names are useful. To be a bit more specific this is some
census data that appears to have one or two duplicated columns. The thing is
that these duplicated columns are not always in the same position so any
hardcoded index won't help here. Using `autogenerate_column_names = TRUE` is
not an option since I would lose all information about column names.
I tried to make a small example with 200 vars and 1,000,000 rows but can't
reproduce the time that took `read_csv_arrow()`
in my "real" example.
``` r
library(arrow)
#>
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#>
#> timestamp
library(tictoc)
packageVersion("arrow")
#> [1] '11.0.0.3'
file_location <- tempfile(fileext = ".csv")
# make a fake "big" dataset
tmp <- list()
for (i in 1:200) {
set.seed(i)
tmp[[paste0("var_", i)]] <- sample(1:100, 1e6, TRUE)
}
test <- list2DF(tmp)
# make a duplicated column name
names(test)[62] <- "var_1"
readr::write_csv(test, file_location)
tictoc::tic()
file <- read_csv_arrow(file_location, as_data_frame = FALSE)
tictoc::toc()
#> 11.82 sec elapsed
# extract the schema from the table
my_schema <- file$schema
# we can see the duplicated names here
dupes <- which(duplicated(names(my_schema)))
for (i in dupes) {
# get original variable name and add a random suffix (so that the new name
# is not a duplicate of another one)
orig <- names(my_schema)[i]
set.seed(i)
suffix <- paste(sample(letters, 8), collapse = "")
new_var <- paste0(orig, "_", suffix)
# get the variable type
orig_field <- my_schema$fields[[i]]$type$code()
# update the variable
my_schema[[i]] <- field(new_var, eval(orig_field))
cat(paste("Old variable name:", orig, "\nNew variable name:", new_var,
"\n\n"))
}
#> Old variable name: var_1
#> New variable name: var_1_vkeoxuwd
# open the dataset, specifying the new schema
# we have to include "skip" to skip the first row of the file
ds <- arrow::open_csv_dataset(file_location, schema = my_schema, skip = 1)
out <- dplyr::collect(ds)
```
This is not an urgent issue for me but I think having a way to automatically
repair duplicated column names would be useful. Is it feasible to implement?
(You proposed a better error message and a workaround, so I'd just like to
clarify on whether it could be a feature later)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]