[ 
https://issues.apache.org/jira/browse/ARROW-15926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17506219#comment-17506219
 ] 

Dewey Dunnington commented on ARROW-15926:
------------------------------------------

Thank you for reporting! It is definitely a confusing message that we need to 
fix. You're correct that the preferred idiom is {{csv_dataset %>% select(key)}} 
in this case, which should automatically only read the necessary columns.

Pinging [~thisisnic], since they've done the most work getting 
{{read_csv_arrow()}} and {{open_dataset(format = "csv")}} to agree with 
eachother.

Re-rendering your excellent reprex below to be Jira friendly:

{code:R}
library(tidyverse)
library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp

tmpf <- tempfile()

dat <- tribble(
  ~key, ~val1,
  "A", "1",
  "B", "2",
)

write_csv(dat, tmpf)


read_csv_arrow(
  tmpf,
  convert_options = CsvConvertOptions$create(
    include_columns = "key"
  ))
#> # A tibble: 2 × 1
#>   key  
#>   <chr>
#> 1 A    
#> 2 B

open_dataset(
  tmpf, format = "csv",
  convert_options = CsvConvertOptions$create(
    include_columns = "key"
  )) %>% collect()
#> Error in `handle_csv_read_error()` at r/R/dplyr-collect.R:33:6:
#> ! Invalid: Multiple matches for FieldRef.Name(key) in key:   [
#>     "A",
#>     "B"
#>   ]
#> key:   [
#>     "A",
#>     "B"
#>   ]
#> 
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/type.h:1727  
CheckNonMultiple(matches, root)
#> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/type.h:1759  
FindOneOrNone(root)
#> 
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/expression.cc:438
  FieldRef(field->name()).GetOneOrNone(partial_batch)
#> 
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/dataset/scanner.cc:865
  compute::MakeExecBatch(*scan_options->dataset_schema, 
partial.record_batch.value)
#> 
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:484
  iterator_.Next()
#> 
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:336 
 ReadNext(&batch)
#> 
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:347 
 ReadAll(&batches)

open_dataset(tmpf, format = "csv") %>%
  select(key) %>%
  collect()
#> # A tibble: 2 × 1
#>   key  
#>   <chr>
#> 1 A    
#> 2 B
{code}


> [R] CsvConvertOptions include_columns bug in open_dataset vs. read_csv_arrow
> ----------------------------------------------------------------------------
>
>                 Key: ARROW-15926
>                 URL: https://issues.apache.org/jira/browse/ARROW-15926
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 7.0.0
>         Environment: Windows 10
>            Reporter: Jameel Alsalam
>            Priority: Minor
>
> I think there is a bug when reading a csv dataset where you don't want to 
> read in all columns. As shown below, the identical code works in 
> read_csv_arrow but errors in open_dataset. This can be worked around by 
> reading in all columns and then selecting afterwards, but I am not sure if 
> there is any performance advantage to omitting columns at the reading step.
>  
> ``` r
> library(tidyverse)
> library(arrow)
> #> 
> #> Attaching package: 'arrow'
> tmpf <- tempfile()
> dat <- tribble(
>   ~key, ~val1,
>   "A", "1",
>   "B", "2",
> )
> write_csv(dat, tmpf)
> # works in read_csv_arrow, errors in open_dataset:
> read_csv_arrow(
>   tmpf,
>   convert_options = CsvConvertOptions$create(
>     include_columns = "key"
>   ))
> #> # A tibble: 2 x 1
> #>   key  
> #>   <chr>
> #> 1 A    
> #> 2 B
> open_dataset(
>   tmpf, format = "csv",
>   convert_options = CsvConvertOptions$create(
>     include_columns = "key"
>   )) %>% collect()
> #> Error in `handle_csv_read_error()`:
> #> ! Invalid: Multiple matches for FieldRef.Name(key) in key:   [
> #>     "A",
> #>     "B"
> #>   ]
> #> key:   [
> #>     "A",
> #>     "B"
> #>   ]
> # Note that it does work to select after open_dataset, thus not a blocking 
> issue:
> open_dataset(tmpf, format = "csv") %>%
>   select(key) %>%
>   collect()
> #> # A tibble: 2 x 1
> #>   key  
> #>   <chr>
> #> 1 A    
> #> 2 B
> ```
> <sup>Created on 2022-03-12 by the [reprex 
> package]([https://reprex.tidyverse.org|https://reprex.tidyverse.org/]) 
> (v2.0.1)</sup>
>  
> I have tried this both with CRAN version 7 and the nightly version.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to