[jira] [Updated] (ARROW-15088) [R] Support for csv options on open_dataset

Nicola Crane (Jira) Wed, 26 Jan 2022 10:09:07 -0800


     [ 
https://issues.apache.org/jira/browse/ARROW-15088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Nicola Crane updated ARROW-15088:
---------------------------------
    Issue Type: Improvement  (was: Bug)

> [R] Support for csv options on open_dataset
> -------------------------------------------
>
>                 Key: ARROW-15088
>                 URL: https://issues.apache.org/jira/browse/ARROW-15088
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>    Affects Versions: 6.0.2
>            Reporter: Carl Boettiger
>            Priority: Major
>
> There's a lot of gotchas created around heterogeneity in arrow's support for 
> csv parsing options beween read_csv_arrow() and open_dataset() (and further 
> issues arising from migrating from readr::read_csv()).  Not sure if it's more 
> helpful to report these in one place or as separate issues, but here's a few 
> that keep tripping me up:
>  
>  * "na" (defining the na-character choices) is not implemented on 
> open_dataset(), though it is on read_csv_arrow()
>  * somewhat confusingly, open_dataset does support `null_strings` though, 
> which appears to play the same roll.   The docs however suggest that 
> `open_dataset()` `...` options are passed to `dataset_factory()`.  I think 
> those docs should link to 
> [https://arrow.apache.org/docs/r/reference/CsvReadOptions.html] .  
> [https://arrow.apache.org/docs/r/reference/FileFormat.html] suggests that 
> `null_strings` is not one of the recognized CsvReadOptions, but it seems that 
> it now is.  I appreciate the challenge of supporting both the readr-like 
> options and the native arrow option names here, but the functionality and 
> documentation remains very confusing!
> Also another gotcha: in arrow 6.0 release, if we supply an arrow schema, 
> open_dataset assumes the first line of the csv is data and not column 
> headers, so we have to do skip=1.  I see the logic (the schema names the 
> columns anyway, so assuming we're going with those names why parse the names 
> from the csv), but it's surprising since reading without the schema we do not 
> use skip=1, and it's natural to want to go and declare column types while 
> preserving csv column names.  The error messages on doing so aren't helpful, 
> since if you forget skip=1, you are just told that any column that is not a 
> string is "the incorrect type".  The open_dataset() docs imply that we can 
> use read_csv_arrow() options, which suggest that we could provide types using 
> col_types() instead of schema, but this appears not to be the case.  Also



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15088) [R] Support for csv options on open_dataset

Reply via email to