Shaun Nielsen created ARROW-12083:
-------------------------------------
Summary: [R] schema use in open_dataset
Key: ARROW-12083
URL: https://issues.apache.org/jira/browse/ARROW-12083
Project: Apache Arrow
Issue Type: Improvement
Components: R
Affects Versions: 3.0.0
Environment: Windows
Reporter: Shaun Nielsen
I have a directory of split .csvs that I'm importing with open_dataset().
Between files, a column is imported as either int64 (e.g. -2) and the other
string (1986CD), and this throws an error when {{unify_schemas = T}}
{{ arrow::open_dataset('./split-csvs/nswcr/', format = 'csv', unify_schemas =
T)}}
{{Error: Invalid: Unable to merge: Field SEIFACalcMethod has incompatible
types: int64 vs string}}
If I use the schema parameter, and only want to specify this column, I only am
able to import this column
{{arrow::open_dataset('./split-csvs/nswcr/', }}{{format = 'csv', }}{{schema =
schema(SEIFACalcMethod = string()))}}
{{ }}
{{FileSystemDataset with 45 csv files}}
{{SEIFACalcMethod: string}}
I was expecting that could set the class of a select few columns, while the
rest would be imported as-is. Similar to readr::read_csv(col_types = cols())
approach.
Not sure if this is expected behaviour, a bug, or a possible avenue for
improvement. I've tagged this as the latter. (y)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)