[GitHub] [arrow] etiennebacher commented on issue #34965: [R] Add an argument to `open_csv_dataset()` to repair duplicated column names or ignore them?

via GitHub Wed, 12 Apr 2023 05:33:23 -0700


etiennebacher commented on issue #34965:
URL: https://github.com/apache/arrow/issues/34965#issuecomment-1505196369


   Yes the column names are useful. To be a bit more specific this is some 
census data that appears to have one or two duplicated columns. The thing is 
that these duplicated columns are not always in the same position so any 
hardcoded index won't help here. Using `autogenerate_column_names = TRUE` is 
not an option since I would lose all information about column names.
   
   I tried to make a small example with 200 vars and 1,000,000 rows but can't 
reproduce the time that took `read_csv_arrow()` 
   in my "real" example. 
   
   ``` r
   library(arrow)
   #> 
   #> Attaching package: 'arrow'
   #> The following object is masked from 'package:utils':
   #> 
   #>     timestamp
   library(tictoc)
   packageVersion("arrow")
   #> [1] '11.0.0.3'
   
   file_location <- tempfile(fileext = ".csv")
   
   # make a fake "big" dataset
   tmp <- list()
   for (i in 1:200) {
     set.seed(i)
     tmp[[paste0("var_", i)]] <- sample(1:100, 1e6, TRUE)
   }
   test <- list2DF(tmp)
   
   # make a duplicated column name
   names(test)[62] <- "var_1"
   
   readr::write_csv(test, file_location)
   
   tictoc::tic()
   file <- read_csv_arrow(file_location, as_data_frame = FALSE)
   tictoc::toc()
   #> 11.82 sec elapsed
   
   # extract the schema from the table
   my_schema <- file$schema
   
   # we can see the duplicated names here
   dupes <- which(duplicated(names(my_schema)))
   
   for (i in dupes) {
     
     # get original variable name and add a random suffix (so that the new name
     # is not a duplicate of another one)
     orig <- names(my_schema)[i]
     set.seed(i)
     suffix <- paste(sample(letters, 8), collapse = "")
     
     new_var <- paste0(orig, "_", suffix)
     
     # get the variable type
     orig_field <- my_schema$fields[[i]]$type$code()
     
     # update the variable
     my_schema[[i]] <- field(new_var, eval(orig_field))
     
     cat(paste("Old variable name:", orig, "\nNew variable name:", new_var, 
"\n\n"))
     
   }
   #> Old variable name: var_1 
   #> New variable name: var_1_vkeoxuwd
   
   # open the dataset, specifying the new schema
   # we have to include "skip" to skip the first row of the file
   ds <- arrow::open_csv_dataset(file_location, schema = my_schema, skip = 1)
   out <- dplyr::collect(ds)
   ```
   
   This is not an urgent issue for me but I think having a way to automatically 
repair duplicated column names would be useful. Is it feasible to implement? 
(You proposed a better error message and a workaround, so I'd just like to 
clarify on whether it could be a feature later)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] etiennebacher commented on issue #34965: [R] Add an argument to `open_csv_dataset()` to repair duplicated column names or ignore them?

Reply via email to