orgadish commented on issue #38031:
URL: https://github.com/apache/arrow/issues/38031#issuecomment-1752197296
@thisisnic I don't know if this was updated in a recent Arrow version, but
it looks like what I want works now!
Below is a reprex for it. `read_csv(col_select = ...)` actually does _not_
work, so I'm glad `open_dataset` does!
Closing this issue.
``` r
suppressPackageStartupMessages(library(arrow))
suppressPackageStartupMessages(library(dplyr))
mtcars_part_1 <- mtcars |>
filter(am == 0) |>
select(mpg, cyl, disp)
mtcars_part_2 <- mtcars |>
filter(am == 1) |>
select(mpg, cyl, hp)
tf <- tempfile()
dir.create(tf)
tf1 <- tempfile(tmpdir=tf)
dir.create(tf1)
tf2 <- tempfile(tmpdir=tf)
dir.create(tf2)
write_csv_arrow(mtcars_part_1, file.path(tf1, "mtcars_subset.csv"))
write_csv_arrow(mtcars_part_2, file.path(tf2, "mtcars_subset.csv"))
csv_files <- list.files(tf, full.names = TRUE, recursive=TRUE)
basename(csv_files)
#> [1] "mtcars_subset.csv" "mtcars_subset.csv"
columns_i_care_about <- c("mpg", "cyl")
# This used to fail, but it seems to be working now...
open_csv_dataset(csv_files, unify_schemas = TRUE) |>
collect()
#> # A tibble: 32 × 4
#> mpg cyl disp hp
#> <dbl> <int> <dbl> <int>
#> 1 21.4 6 258 NA
#> 2 18.7 8 360 NA
#> 3 18.1 6 225 NA
#> 4 14.3 8 360 NA
#> 5 24.4 4 147. NA
#> 6 22.8 4 141. NA
#> 7 19.2 6 168. NA
#> 8 17.8 6 168. NA
#> 9 16.4 8 276. NA
#> 10 17.3 8 276. NA
#> # ℹ 22 more rows
open_csv_dataset(csv_files, unify_schemas = TRUE) |>
select(!!columns_i_care_about) |>
collect()
#> # A tibble: 32 × 2
#> mpg cyl
#> <dbl> <int>
#> 1 21 6
#> 2 21 6
#> 3 22.8 4
#> 4 32.4 4
#> 5 30.4 4
#> 6 33.9 4
#> 7 27.3 4
#> 8 26 4
#> 9 30.4 4
#> 10 15.8 8
#> # ℹ 22 more rows
# read_csv(col_select = ) actually doesn't work...
readr::read_csv(csv_files)
#> Error: Files must have consistent column names:
#> * File 1 column 3 is: disp
#> * File 2 column 3 is: hp
readr::read_csv(csv_files, col_select = !!columns_i_care_about)
#> Error: Files must have consistent column names:
#> * File 1 column 3 is: disp
#> * File 2 column 3 is: hp
```
<sup>Created on 2023-10-08 with [reprex
v2.0.2](https://reprex.tidyverse.org)</sup>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]