Jeffrey Wong created ARROW-4027: ----------------------------------- Summary: Reading partitioned datasets using RecordBatchFileReader into R Key: ARROW-4027 URL: https://issues.apache.org/jira/browse/ARROW-4027 Project: Apache Arrow Issue Type: Bug Components: R Affects Versions: 0.11.1 Environment: Ubuntu 16.04, building R package from master on github Reporter: Jeffrey Wong
I have a parquet dataset (which originally came from Hive) stored locally in the directory `data/`. It has 4 files in it ``` data/foo1 data/foo2 data/foo3 data/foo4 ``` Using pyarrow I can read them via `pq.read_table("data/foo1").to_pandas()` I am trying to read them into R using `read_table("data/foo1")`, but I receive this error. ``` Error in ipc___RecordBatchFileReader__Open(file) : Invalid: Not an Arrow file ``` >From debugging, I've traced it to this line >https://github.com/apache/arrow/blob/d54a154263e1dba5515ebe1a8423a676a01e3951/r/R/RecordBatchReader.R#L112, > which then goes to this Rcpp code >https://github.com/apache/arrow/blob/d54a154263e1dba5515ebe1a8423a676a01e3951/r/src/recordbatchreader.cpp#L85. > It seems that this c++ function is expecting a single "[file like >object](https://arrow.apache.org/docs/cpp/classarrow_1_1ipc_1_1_record_batch_file_reader.html#a7e6c66ca32d75bc8d4ee905982d9819e)"; > I think because my data is split, there is a footer that is supposed to >contain a file layout and schema which cannot be found, hence the error Not an >Arrow file. If I pass the whole directory using `read_table("data/")` I will get ``` Error in ipc___RecordBatchFileReader__Open(file) : IOError: Error reading bytes from file: Is a directory ``` I cannot post the original dataset online, and I don't know what aspect of my data causes the code to break, so I don't quite know how to post a reproducible example. Tips on how to generate a partitioned dataset would be great -- This message was sent by Atlassian JIRA (v7.6.3#76005)