[
https://issues.apache.org/jira/browse/ARROW-15088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483901#comment-17483901
]
Carl Boettiger commented on ARROW-15088:
----------------------------------------
Sorry for the confusion. Here's some reproducible examples:
"na is recognized in read_csv_arrow but not implemented on open_dataset".
Witness:
{code:java}
library(arrow)
library(readr)
csv <- readr_example("mtcars.csv")
read_csv_arrow(csv, na="-99") # Works
open_dataset(csv, na="-99") # Error:{code}
{code:java}
# Error in ParquetFragmentScanOptions$create(...) :
# unused argument (na = "-99") {code}
- Re the docs, I'm explicitly referring to the docstring about the `...` option
in `open_dataset`, which links to dataset_factory, which says:
> Additional format-specific options, passed to {{{}FileFormat$create(){}}}.
> For CSV options, note that you can specify them either with the Arrow C++
> library naming ("delimiter", "quoting", etc.) or the {{{}readr{}}}-style
> naming used in
> {{[read_csv_arrow()|https://arrow.apache.org/docs/r/reference/read_delim_arrow.html]}}
> ("delim", "quote", etc.). Not all {{readr}} options are currently supported;
> please file an issue if you encounter one that {{arrow}} should support.
Re `null_strings`, I'm stumped there too – I see `null_values` documented in
[https://arrow.apache.org/docs/r/reference/CsvReadOptions.html] still, but I
cannot use it with either open_dataset() or read_csv_arrow(). Can you point me
an reproducible example that uses that option? Am I just misunderstanding that
those options documented at that link can be used in open_dataset() etc and
read_csv_arrow()
> [R] Support for csv options on open_dataset
> -------------------------------------------
>
> Key: ARROW-15088
> URL: https://issues.apache.org/jira/browse/ARROW-15088
> Project: Apache Arrow
> Issue Type: Improvement
> Components: R
> Affects Versions: 6.0.2
> Reporter: Carl Boettiger
> Priority: Major
>
> There's a lot of gotchas created around heterogeneity in arrow's support for
> csv parsing options beween read_csv_arrow() and open_dataset() (and further
> issues arising from migrating from readr::read_csv()). Not sure if it's more
> helpful to report these in one place or as separate issues, but here's a few
> that keep tripping me up:
>
> * "na" (defining the na-character choices) is not implemented on
> open_dataset(), though it is on read_csv_arrow()
> * somewhat confusingly, open_dataset does support `null_strings` though,
> which appears to play the same roll. The docs however suggest that
> `open_dataset()` `...` options are passed to `dataset_factory()`. I think
> those docs should link to
> [https://arrow.apache.org/docs/r/reference/CsvReadOptions.html] .
> [https://arrow.apache.org/docs/r/reference/FileFormat.html] suggests that
> `null_strings` is not one of the recognized CsvReadOptions, but it seems that
> it now is. I appreciate the challenge of supporting both the readr-like
> options and the native arrow option names here, but the functionality and
> documentation remains very confusing!
> Also another gotcha: in arrow 6.0 release, if we supply an arrow schema,
> open_dataset assumes the first line of the csv is data and not column
> headers, so we have to do skip=1. I see the logic (the schema names the
> columns anyway, so assuming we're going with those names why parse the names
> from the csv), but it's surprising since reading without the schema we do not
> use skip=1, and it's natural to want to go and declare column types while
> preserving csv column names. The error messages on doing so aren't helpful,
> since if you forget skip=1, you are just told that any column that is not a
> string is "the incorrect type". The open_dataset() docs imply that we can
> use read_csv_arrow() options, which suggest that we could provide types using
> col_types() instead of schema, but this appears not to be the case. Also
--
This message was sent by Atlassian Jira
(v8.20.1#820001)