[GitHub] [arrow] thisisnic commented on issue #34965: [R] Add an argument to `open_csv_dataset()` to repair duplicated column names or ignore them?

via GitHub Tue, 11 Apr 2023 03:44:25 -0700


thisisnic commented on issue #34965:
URL: https://github.com/apache/arrow/issues/34965#issuecomment-1503095194


   Thanks for reporting this @etiennebacher!  I can confirm that this is 
reproducible on the dev version of Arrow.
   You're not missing something obvious; Arrow Dataset objects don't allow you 
to have duplicated column names I believe.  That error message isn't the most 
helpful, so we could probably do with improving it and/or adding in code which 
fixes this.
   
   As a temporary workaround, you could manually supply a schema to the data 
with the corrected column names.  I've added a brief example below; let me know 
if this works for your specific case.  If it's still tricky, there'll be other 
workarounds we can try.
   
   ``` r
   library(arrow)
   
   file_location <- tempfile(fileext = ".csv")
   
   test <- data.frame(x = 1, x = 2, check.names = FALSE)
   
   write.csv(test, file_location, row.names = FALSE)
   
   # works fine with readr
   readr::read_csv(file_location)
   #> New names:
   #> • `x` -> `x...1`
   #> • `x` -> `x...2`
   #> Rows: 1 Columns: 2
   #> ── Column specification 
────────────────────────────────────────────────────────
   #> Delimiter: ","
   #> dbl (2): x...1, x...2
   #> 
   #> ℹ Use `spec()` to retrieve the full column specification for this data.
   #> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this 
message.
   #> # A tibble: 1 × 2
   #>   x...1 x...2
   #>   <dbl> <dbl>
   #> 1     1     2
   
   # read in the file as an Arrow Table
   file <- read_csv_arrow(file_location, as_data_frame = FALSE)
   
   # extract the schema from the table
   my_schema <- file$schema
   
   # we can see the duplicated names here
   my_schema
   #> Schema
   #> x: int64
   #> x: int64
   
   # update the second field in the schema to be called "y" instead
   my_schema[[2]] <- field("y", int64())
   
   # open the dataset, specifying the new schema
   # we have to include "skip" to skip the first row of the file
   ds <- arrow::open_csv_dataset(file_location, schema = my_schema, skip = 1)
   dplyr::collect(ds)
   #> # A tibble: 1 × 2
   #>       x     y
   #>   <int> <int>
   #> 1     1     2
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] thisisnic commented on issue #34965: [R] Add an argument to `open_csv_dataset()` to repair duplicated column names or ignore them?

Reply via email to