Re: [I] [R] Enable `col_select` or similar in `open_csv_dataset` to read files with a shared subset of columns [arrow]

via GitHub Sun, 08 Oct 2023 16:56:11 -0700


orgadish commented on issue #38031:
URL: https://github.com/apache/arrow/issues/38031#issuecomment-1752197296


   @thisisnic I don't know if this was updated in a recent Arrow version, but 
it looks like what I want works now!
   
   Below is a reprex for it. `read_csv(col_select = ...)` actually does _not_ 
work, so I'm glad `open_dataset` does!
   
   Closing this issue.
   
   ``` r
   suppressPackageStartupMessages(library(arrow))
   suppressPackageStartupMessages(library(dplyr))
   
   mtcars_part_1 <- mtcars |> 
     filter(am == 0) |> 
     select(mpg, cyl, disp)
   
   mtcars_part_2 <- mtcars |> 
     filter(am == 1) |> 
     select(mpg, cyl, hp)
   
   tf <- tempfile()
   dir.create(tf)
   tf1 <- tempfile(tmpdir=tf)
   dir.create(tf1)
   tf2 <- tempfile(tmpdir=tf)
   dir.create(tf2)
   write_csv_arrow(mtcars_part_1, file.path(tf1, "mtcars_subset.csv"))
   write_csv_arrow(mtcars_part_2, file.path(tf2, "mtcars_subset.csv"))
   csv_files <- list.files(tf, full.names = TRUE, recursive=TRUE)
   basename(csv_files)
   #> [1] "mtcars_subset.csv" "mtcars_subset.csv"
   
   columns_i_care_about <- c("mpg", "cyl")
   
   # This used to fail, but it seems to be working now...
   open_csv_dataset(csv_files, unify_schemas = TRUE) |> 
     collect()
   #> # A tibble: 32 × 4
   #>      mpg   cyl  disp    hp
   #>    <dbl> <int> <dbl> <int>
   #>  1  21.4     6  258     NA
   #>  2  18.7     8  360     NA
   #>  3  18.1     6  225     NA
   #>  4  14.3     8  360     NA
   #>  5  24.4     4  147.    NA
   #>  6  22.8     4  141.    NA
   #>  7  19.2     6  168.    NA
   #>  8  17.8     6  168.    NA
   #>  9  16.4     8  276.    NA
   #> 10  17.3     8  276.    NA
   #> # ℹ 22 more rows
   open_csv_dataset(csv_files, unify_schemas = TRUE) |> 
     select(!!columns_i_care_about) |> 
     collect()
   #> # A tibble: 32 × 2
   #>      mpg   cyl
   #>    <dbl> <int>
   #>  1  21       6
   #>  2  21       6
   #>  3  22.8     4
   #>  4  32.4     4
   #>  5  30.4     4
   #>  6  33.9     4
   #>  7  27.3     4
   #>  8  26       4
   #>  9  30.4     4
   #> 10  15.8     8
   #> # ℹ 22 more rows
   
   # read_csv(col_select = ) actually doesn't work...
   readr::read_csv(csv_files)
   #> Error: Files must have consistent column names:
   #> * File 1 column 3 is: disp
   #> * File 2 column 3 is: hp
   readr::read_csv(csv_files, col_select = !!columns_i_care_about)
   #> Error: Files must have consistent column names:
   #> * File 1 column 3 is: disp
   #> * File 2 column 3 is: hp
   ```
   
   <sup>Created on 2023-10-08 with [reprex 
v2.0.2](https://reprex.tidyverse.org)</sup>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [R] Enable `col_select` or similar in `open_csv_dataset` to read files with a shared subset of columns [arrow]

Reply via email to