Carl Boettiger created ARROW-15879:
--------------------------------------
Summary: passing a schema calls open_dataset to fail on
hive-partitioned csv files
Key: ARROW-15879
URL: https://issues.apache.org/jira/browse/ARROW-15879
Project: Apache Arrow
Issue Type: Improvement
Components: R
Affects Versions: 7.0.0, 7.0.1
Reporter: Carl Boettiger
Consider this reprex:
Create a dataset with hive partitions in csv format with write_dataset() (so
cool!):
{code:java}
library(arrow)
library(dplyr)
path <- fs::dir_create("tmp")
mtcars %>% group_by(gear) %>% write_dataset(path, format="csv")## works fine,
even with 'collect()'
ds <- open_dataset(path, format="csv")## but pass a schema, and things fail
df <- open_dataset(path, format="csv", schema = ds$schema, skip_rows=1)
df %>% collect()
{code}
In the first call to open_dataset, we don't pass a schema and things work as
expected.
However, csv files often need a schema to be read in correctly, particularly
with partitioned data where it is easy to 'guess' the wrong type. Passing the
schema though confuses open_dataset, because the grouping column (partition
column) isn't found on the individual files even though it is mentioned in the
schema!
Nor can we just omit the grouping column from the schema, since then it is
effectively lost from the data.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)