[
https://issues.apache.org/jira/browse/ARROW-15088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nicola Crane updated ARROW-15088:
---------------------------------
Issue Type: Improvement (was: Bug)
> [R] Support for csv options on open_dataset
> -------------------------------------------
>
> Key: ARROW-15088
> URL: https://issues.apache.org/jira/browse/ARROW-15088
> Project: Apache Arrow
> Issue Type: Improvement
> Components: R
> Affects Versions: 6.0.2
> Reporter: Carl Boettiger
> Priority: Major
>
> There's a lot of gotchas created around heterogeneity in arrow's support for
> csv parsing options beween read_csv_arrow() and open_dataset() (and further
> issues arising from migrating from readr::read_csv()). Not sure if it's more
> helpful to report these in one place or as separate issues, but here's a few
> that keep tripping me up:
>
> * "na" (defining the na-character choices) is not implemented on
> open_dataset(), though it is on read_csv_arrow()
> * somewhat confusingly, open_dataset does support `null_strings` though,
> which appears to play the same roll. The docs however suggest that
> `open_dataset()` `...` options are passed to `dataset_factory()`. I think
> those docs should link to
> [https://arrow.apache.org/docs/r/reference/CsvReadOptions.html] .
> [https://arrow.apache.org/docs/r/reference/FileFormat.html] suggests that
> `null_strings` is not one of the recognized CsvReadOptions, but it seems that
> it now is. I appreciate the challenge of supporting both the readr-like
> options and the native arrow option names here, but the functionality and
> documentation remains very confusing!
> Also another gotcha: in arrow 6.0 release, if we supply an arrow schema,
> open_dataset assumes the first line of the csv is data and not column
> headers, so we have to do skip=1. I see the logic (the schema names the
> columns anyway, so assuming we're going with those names why parse the names
> from the csv), but it's surprising since reading without the schema we do not
> use skip=1, and it's natural to want to go and declare column types while
> preserving csv column names. The error messages on doing so aren't helpful,
> since if you forget skip=1, you are just told that any column that is not a
> string is "the incorrect type". The open_dataset() docs imply that we can
> use read_csv_arrow() options, which suggest that we could provide types using
> col_types() instead of schema, but this appears not to be the case. Also
--
This message was sent by Atlassian Jira
(v8.20.1#820001)