[jira] [Commented] (ARROW-6830) [R] Select Subset of Columns in read_arrow

Neal Richardson (Jira) Thu, 10 Oct 2019 09:04:45 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948727#comment-16948727
 ]


Neal Richardson commented on ARROW-6830:
----------------------------------------

[https://github.com/apache/arrow/blob/master/r/R/read-table.R] is pretty simple 
(and note that if you give it a string file name, it will invoke 
RecordBatchFileReader). There are no additional arguments that would let you 
push computation down to record batches contained within the file (though I 
thought we were talking about selecting columns). We are working on a [C++ 
Datasets 
API|https://docs.google.com/document/d/1bVhzifD38qDypnSjtf8exvpP3sSB5x_Kw9m-n66FB2c/edit?pli=1#heading=h.22aikbvt54fv]
 that will do that and much more. 

If you want to do some of that in R now, RecordBatchFileReader sounds like a 
reasonable place to start. It memory maps by default, and as you've seen you 
can iterate over the batches. You can filter each record batch separately 
(using {{[}} methods or lower level if you prefer) and collect them all into a 
data.frame.

> [R] Select Subset of Columns in read_arrow
> ------------------------------------------
>
>                 Key: ARROW-6830
>                 URL: https://issues.apache.org/jira/browse/ARROW-6830
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: R
>            Reporter: Anthony Abate
>            Priority: Minor
>
> *Note:*  Not sure if this is a limitation of the R library or the underlying 
> C++ code:
> I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record 
> batches of varying row sizes
> 1. Is it possible at to use *read_arrow* to filter out columns?  (similar to 
> how *read_feather* has a (col_select =... )
> 2. Or is it possible using *RecordBatchFileReader* to filter columns?
>  
> The only thing I seem to be able to do (please confirm if this is my only 
> option) is loop over all record batches, select a single column at a time, 
> and construct the data I need to pull out manually.  ie like the following:
> {code:java}
> for(i in 0:data_rbfr$num_record_batches) {
>     rbn <- data_rbfr$get_batch(i)
>   
>   if (i == 0) 
>   {
>     merged <- as.data.frame(rbn$column(5)$as_vector())
>   }
>   else 
>   {
>     dfn <- as.data.frame(rbn$column(5)$as_vector())
>     merged <- rbind(merged,dfn)
>   }
>     
>   print(paste(i, nrow(merged)))
> } {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-6830) [R] Select Subset of Columns in read_arrow

Reply via email to