[jira] [Commented] (ARROW-15088) [R] Support for csv options on open_dataset

Carl Boettiger (Jira) Fri, 28 Jan 2022 10:15:06 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-15088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483901#comment-17483901
 ]


Carl Boettiger commented on ARROW-15088:
----------------------------------------

Sorry for the confusion.  Here's some reproducible examples:

"na is recognized in read_csv_arrow but  not implemented on open_dataset". 
Witness:
{code:java}
library(arrow)
library(readr)
csv <- readr_example("mtcars.csv")
read_csv_arrow(csv, na="-99")  # Works
open_dataset(csv, na="-99") # Error:{code}
{code:java}
 # Error in ParquetFragmentScanOptions$create(...) : 
 # unused argument (na = "-99") {code}



- Re the docs, I'm explicitly referring to the docstring about the `...` option 
in `open_dataset`, which links to dataset_factory, which says:
> Additional format-specific options, passed to {{{}FileFormat$create(){}}}. 
> For CSV options, note that you can specify them either with the Arrow C++ 
> library naming ("delimiter", "quoting", etc.) or the {{{}readr{}}}-style 
> naming used in 
> {{[read_csv_arrow()|https://arrow.apache.org/docs/r/reference/read_delim_arrow.html]}}
>  ("delim", "quote", etc.). Not all {{readr}} options are currently supported; 
> please file an issue if you encounter one that {{arrow}} should support.

 

Re `null_strings`, I'm stumped there too – I see `null_values` documented in 
[https://arrow.apache.org/docs/r/reference/CsvReadOptions.html] still, but I 
cannot use it with either open_dataset() or read_csv_arrow().  Can you point me 
an reproducible example that uses that option?  Am I just misunderstanding that 
those options documented at that link can be used in open_dataset() etc and 
read_csv_arrow()

 

> [R] Support for csv options on open_dataset
> -------------------------------------------
>
>                 Key: ARROW-15088
>                 URL: https://issues.apache.org/jira/browse/ARROW-15088
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>    Affects Versions: 6.0.2
>            Reporter: Carl Boettiger
>            Priority: Major
>
> There's a lot of gotchas created around heterogeneity in arrow's support for 
> csv parsing options beween read_csv_arrow() and open_dataset() (and further 
> issues arising from migrating from readr::read_csv()).  Not sure if it's more 
> helpful to report these in one place or as separate issues, but here's a few 
> that keep tripping me up:
>  
>  * "na" (defining the na-character choices) is not implemented on 
> open_dataset(), though it is on read_csv_arrow()
>  * somewhat confusingly, open_dataset does support `null_strings` though, 
> which appears to play the same roll.   The docs however suggest that 
> `open_dataset()` `...` options are passed to `dataset_factory()`.  I think 
> those docs should link to 
> [https://arrow.apache.org/docs/r/reference/CsvReadOptions.html] .  
> [https://arrow.apache.org/docs/r/reference/FileFormat.html] suggests that 
> `null_strings` is not one of the recognized CsvReadOptions, but it seems that 
> it now is.  I appreciate the challenge of supporting both the readr-like 
> options and the native arrow option names here, but the functionality and 
> documentation remains very confusing!
> Also another gotcha: in arrow 6.0 release, if we supply an arrow schema, 
> open_dataset assumes the first line of the csv is data and not column 
> headers, so we have to do skip=1.  I see the logic (the schema names the 
> columns anyway, so assuming we're going with those names why parse the names 
> from the csv), but it's surprising since reading without the schema we do not 
> use skip=1, and it's natural to want to go and declare column types while 
> preserving csv column names.  The error messages on doing so aren't helpful, 
> since if you forget skip=1, you are just told that any column that is not a 
> string is "the incorrect type".  The open_dataset() docs imply that we can 
> use read_csv_arrow() options, which suggest that we could provide types using 
> col_types() instead of schema, but this appears not to be the case.  Also



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15088) [R] Support for csv options on open_dataset

Reply via email to