[jira] [Closed] (ARROW-4027) Reading partitioned datasets using RecordBatchFileReader into R

Wes McKinney (JIRA) Thu, 13 Dec 2018 21:54:18 -0800


     [ 
https://issues.apache.org/jira/browse/ARROW-4027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Wes McKinney closed ARROW-4027.
-------------------------------
    Resolution: Duplicate

[~jeffreyw] reading Parquet files is not yet supported in R, see ARROW-3731 for 
a discussion of adding an R API. Help would be appreciated

> Reading partitioned datasets using RecordBatchFileReader into R
> ---------------------------------------------------------------
>
>                 Key: ARROW-4027
>                 URL: https://issues.apache.org/jira/browse/ARROW-4027
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 0.11.1
>         Environment: Ubuntu 16.04, building R package from master on github
>            Reporter: Jeffrey Wong
>            Priority: Major
>
> I have a parquet dataset (which originally came from Hive) stored locally in 
> the directory `data/`. It has 4 files in it
> ```
>  data/foo1
>  data/foo2
>  data/foo3
>  data/foo4
>  ```
> Using pyarrow I can read them via
> `pq.read_table("data/foo1").to_pandas()`
> I am trying to read them into R using `read_table("data/foo1")`, but I 
> receive this error.
> ```
>  Error in ipc___RecordBatchFileReader__Open(file) : 
>  Invalid: Not an Arrow file 
>  ```
> From debugging, I've traced it to this line 
> [https://github.com/apache/arrow/blob/d54a154263e1dba5515ebe1a8423a676a01e3951/r/R/RecordBatchReader.R#L112],
>  which then goes to this Rcpp code 
> [https://github.com/apache/arrow/blob/d54a154263e1dba5515ebe1a8423a676a01e3951/r/src/recordbatchreader.cpp#L85].
>  It seems that this c++ function is expecting a single "[file like 
> object]([https://arrow.apache.org/docs/cpp/classarrow_1_1ipc_1_1_record_batch_file_reader.html#a7e6c66ca32d75bc8d4ee905982d9819e])";
>  I think because my data is split, there is a footer that is supposed to 
> contain a file layout and schema which cannot be found, hence the error Not 
> an Arrow file.
>  
> If I pass the whole directory using `read_table("data/")` I will get
> ```
>  Error in ipc___RecordBatchFileReader__Open(file) : 
>  IOError: Error reading bytes from file: Is a directory 
>  ```
> So, how can I use the R package to correctly read multiple parquet files? If 
> I need to call RecordBatchFileReader with a pointer to the footer, file 
> layout and schema, how do I find the footer of the dataset? 
>  
>  
> I cannot post the original dataset online, and I don't know what aspect of my 
> data causes the code to break, so I don't quite know how to post a 
> reproducible example. Tips on how to generate a partitioned dataset would be 
> great



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Closed] (ARROW-4027) Reading partitioned datasets using RecordBatchFileReader into R

Reply via email to