[jira] [Created] (ARROW-12083) [R] schema use in open_dataset

Shaun Nielsen (Jira) Wed, 24 Mar 2021 18:21:04 -0700

Shaun Nielsen created ARROW-12083:
-------------------------------------

             Summary: [R] schema use in open_dataset
                 Key: ARROW-12083
                 URL: https://issues.apache.org/jira/browse/ARROW-12083
             Project: Apache Arrow
          Issue Type: Improvement
          Components: R
    Affects Versions: 3.0.0
         Environment: Windows
            Reporter: Shaun Nielsen



I have a directory of split .csvs that I'm importing with open_dataset(). 
Between files, a column is imported as either int64 (e.g. -2) and the other 
string (1986CD), and this throws an error when {{unify_schemas = T}}

{{ arrow::open_dataset('./split-csvs/nswcr/', format = 'csv', unify_schemas = 
T)}}

{{Error: Invalid: Unable to merge: Field SEIFACalcMethod has incompatible 
types: int64 vs string}}

If I use the schema parameter, and only want to specify this column, I only am 
able to import this column

{{arrow::open_dataset('./split-csvs/nswcr/', }}{{format = 'csv', }}{{schema = 
schema(SEIFACalcMethod = string()))}}

{{ }}
{{FileSystemDataset with 45 csv files}}
{{SEIFACalcMethod: string}}

I was expecting that could set the class of a select few columns, while the 
rest would be imported as-is. Similar to readr::read_csv(col_types = cols()) 
approach.

Not sure if this is expected behaviour, a bug, or a possible avenue for 
improvement. I've tagged this as the latter. (y)

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-12083) [R] schema use in open_dataset

Reply via email to