thisisnic commented on issue #34965:
URL: https://github.com/apache/arrow/issues/34965#issuecomment-1503095194
Thanks for reporting this @etiennebacher! I can confirm that this is
reproducible on the dev version of Arrow.
You're not missing something obvious; Arrow Dataset objects don't allow you
to have duplicated column names I believe. That error message isn't the most
helpful, so we could probably do with improving it and/or adding in code which
fixes this.
As a temporary workaround, you could manually supply a schema to the data
with the corrected column names. I've added a brief example below; let me know
if this works for your specific case. If it's still tricky, there'll be other
workarounds we can try.
``` r
library(arrow)
file_location <- tempfile(fileext = ".csv")
test <- data.frame(x = 1, x = 2, check.names = FALSE)
write.csv(test, file_location, row.names = FALSE)
# works fine with readr
readr::read_csv(file_location)
#> New names:
#> • `x` -> `x...1`
#> • `x` -> `x...2`
#> Rows: 1 Columns: 2
#> ── Column specification
────────────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (2): x...1, x...2
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this
message.
#> # A tibble: 1 × 2
#> x...1 x...2
#> <dbl> <dbl>
#> 1 1 2
# read in the file as an Arrow Table
file <- read_csv_arrow(file_location, as_data_frame = FALSE)
# extract the schema from the table
my_schema <- file$schema
# we can see the duplicated names here
my_schema
#> Schema
#> x: int64
#> x: int64
# update the second field in the schema to be called "y" instead
my_schema[[2]] <- field("y", int64())
# open the dataset, specifying the new schema
# we have to include "skip" to skip the first row of the file
ds <- arrow::open_csv_dataset(file_location, schema = my_schema, skip = 1)
dplyr::collect(ds)
#> # A tibble: 1 × 2
#> x y
#> <int> <int>
#> 1 1 2
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]