[GitHub] [arrow] etiennebacher commented on issue #34965: [R] Add an argument to `open_csv_dataset()` to repair duplicated column names or ignore them?

via GitHub Tue, 11 Apr 2023 05:32:16 -0700


etiennebacher commented on issue #34965:
URL: https://github.com/apache/arrow/issues/34965#issuecomment-1503243275


   Thank you for your answer @thisisnic. The workaround you provided works in 
this very simple case because there are only 2 columns, but I have tens or 
hundreds of them in my scenario. I improved it a bit to detect the duplicated 
names, repair them by adding a random suffix, and plugging them back:
   
   ``` r
   library(arrow)
   #> 
   #> Attaching package: 'arrow'
   #> The following object is masked from 'package:utils':
   #> 
   #>     timestamp
   packageVersion("arrow")
   #> [1] '11.0.0.3'
   
   file_location <- tempfile(fileext = ".csv")
   
   test <- data.frame(x = 1, x = 2, check.names = FALSE)
   write.csv(test, file_location)
   
   file <- read_csv_arrow(file_location, as_data_frame = FALSE)
   
   # extract the schema from the table
   my_schema <- file$schema
   
   # we can see the duplicated names here
   dupes <- which(duplicated(names(my_schema)))
   
   for (i in dupes) {
     
     # get original variable name and add a random suffix (so that the new name
     # is not a duplicate of another one)
     orig <- names(my_schema)[i]
     set.seed(i)
     suffix <- paste(sample(letters, 8), collapse = "")
     
     new_var <- paste0(orig, "_", suffix)
     
     # get the variable type
     orig_field <- my_schema$fields[[i]]$type$code()
     
     # update the variable
     my_schema[[i]] <- field(new_var, eval(orig_field))
     
     cat(paste("Old variable name:", orig, "\nNew variable name:", new_var, 
"\n\n"))
     
   }
   #> Old variable name: x 
   #> New variable name: x_elgdhkvj
   
   # open the dataset, specifying the new schema
   # we have to include "skip" to skip the first row of the file
   ds <- arrow::open_csv_dataset(file_location, schema = my_schema, skip = 1)
   dplyr::collect(ds)
   #> # A tibble: 1 × 3
   #>      ``     x x_elgdhkvj
   #>   <int> <int>      <int>
   #> 1     1     1          2
   ```
   
   (Note that I didn't check that this worked with more than 2 duplicated 
names.)
   
   Also, while this workaround is fast for small files, the original 
`read_csv_arrow()` takes some time. Nothing crazy, but extended to dozens of 
files, this can pile up and lead to an important delay. Maybe there's a faster 
way to do this?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] etiennebacher commented on issue #34965: [R] Add an argument to `open_csv_dataset()` to repair duplicated column names or ignore them?

Reply via email to