Carl Boettiger created ARROW-15879:
--------------------------------------

             Summary: passing a schema calls open_dataset to fail on 
hive-partitioned csv files
                 Key: ARROW-15879
                 URL: https://issues.apache.org/jira/browse/ARROW-15879
             Project: Apache Arrow
          Issue Type: Improvement
          Components: R
    Affects Versions: 7.0.0, 7.0.1
            Reporter: Carl Boettiger


Consider this reprex:

 

Create a dataset with hive partitions in csv format with write_dataset() (so 
cool!):

 
{code:java}
library(arrow)
library(dplyr)
path <- fs::dir_create("tmp")
mtcars %>% group_by(gear) %>% write_dataset(path, format="csv")## works fine, 
even with 'collect()'
ds <- open_dataset(path, format="csv")## but pass a schema, and things fail
df <- open_dataset(path, format="csv", schema = ds$schema, skip_rows=1)
df %>% collect()
 {code}
In the first call to open_dataset, we don't pass a schema and things work as 
expected. 

However, csv files often need a schema to be read in correctly, particularly 
with partitioned data where it is easy to 'guess' the wrong type.  Passing the 
schema though confuses open_dataset, because the grouping column (partition 
column) isn't found on the individual files even though it is mentioned in the 
schema!

Nor can we just omit the grouping column from the schema, since then it is 
effectively lost from the data. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to