[jira] [Created] (ARROW-4027) Reading partitioned datasets using RecordBatchFileReader into R

Jeffrey Wong (JIRA) Thu, 13 Dec 2018 20:21:26 -0800

Jeffrey Wong created ARROW-4027:
-----------------------------------

             Summary: Reading partitioned datasets using RecordBatchFileReader 
into R
                 Key: ARROW-4027
                 URL: https://issues.apache.org/jira/browse/ARROW-4027
             Project: Apache Arrow
          Issue Type: Bug
          Components: R
    Affects Versions: 0.11.1
         Environment: Ubuntu 16.04, building R package from master on github
            Reporter: Jeffrey Wong



I have a parquet dataset (which originally came from Hive) stored locally in 
the directory `data/`. It has 4 files in it

```
data/foo1
data/foo2
data/foo3
data/foo4
```

Using pyarrow I can read them via

`pq.read_table("data/foo1").to_pandas()`

I am trying to read them into R using `read_table("data/foo1")`, but I receive 
this error.

```
 Error in ipc___RecordBatchFileReader__Open(file) : 
 Invalid: Not an Arrow file 
```

>From debugging, I've traced it to this line 
>https://github.com/apache/arrow/blob/d54a154263e1dba5515ebe1a8423a676a01e3951/r/R/RecordBatchReader.R#L112,
> which then goes to this Rcpp code 
>https://github.com/apache/arrow/blob/d54a154263e1dba5515ebe1a8423a676a01e3951/r/src/recordbatchreader.cpp#L85.
> It seems that this c++ function is expecting a single "[file like 
>object](https://arrow.apache.org/docs/cpp/classarrow_1_1ipc_1_1_record_batch_file_reader.html#a7e6c66ca32d75bc8d4ee905982d9819e)";
> I think because my data is split, there is a footer that is supposed to 
>contain a file layout and schema which cannot be found, hence the error Not an 
>Arrow file.

 

If I pass the whole directory using `read_table("data/")` I will get

```
Error in ipc___RecordBatchFileReader__Open(file) : 
 IOError: Error reading bytes from file: Is a directory 
```

 

 

 

I cannot post the original dataset online, and I don't know what aspect of my 
data causes the code to break, so I don't quite know how to post a reproducible 
example. Tips on how to generate a partitioned dataset would be great



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-4027) Reading partitioned datasets using RecordBatchFileReader into R

Reply via email to