[
https://issues.apache.org/jira/browse/ARROW-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948751#comment-16948751
]
Anthony Abate commented on ARROW-6830:
--------------------------------------
{quote}You can filter each record batch separately (using {{[}} methods or
lower level if you prefer) and collect them all into a data.frame.
{quote}
this is what I am doing - is there a better way so I can do multiple columns in
a single pass?
{code:java}
rbn <- data_rbfr$get_batch(i)
df <- data.frame(
rbn$column(5)$as_vector(),rbn$column(6)$as_vector(),rbn$column(100)$as_vector(),rbn$column(687)$as_vector(),
rbn$column(444)$as_vector(),rbn$column(36)$as_vector(),rbn$column(500)$as_vector(),rbn$column(897)$as_vector(),
rbn$column(24)$as_vector(),rbn$column(446)$as_vector(),rbn$column(777)$as_vector(),rbn$column(333)$as_vector(),
rbn$column(96)$as_vector(),rbn$column(555)$as_vector(),rbn$column(888)$as_vector(),rbn$column(222)$as_vector()
) {code}
> [R] Select Subset of Columns in read_arrow
> ------------------------------------------
>
> Key: ARROW-6830
> URL: https://issues.apache.org/jira/browse/ARROW-6830
> Project: Apache Arrow
> Issue Type: New Feature
> Components: R
> Reporter: Anthony Abate
> Priority: Minor
>
> *Note:* Not sure if this is a limitation of the R library or the underlying
> C++ code:
> I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record
> batches of varying row sizes
> 1. Is it possible at to use *read_arrow* to filter out columns? (similar to
> how *read_feather* has a (col_select =... )
> 2. Or is it possible using *RecordBatchFileReader* to filter columns?
>
> The only thing I seem to be able to do (please confirm if this is my only
> option) is loop over all record batches, select a single column at a time,
> and construct the data I need to pull out manually. ie like the following:
> {code:java}
> for(i in 0:data_rbfr$num_record_batches) {
> rbn <- data_rbfr$get_batch(i)
>
> if (i == 0)
> {
> merged <- as.data.frame(rbn$column(5)$as_vector())
> }
> else
> {
> dfn <- as.data.frame(rbn$column(5)$as_vector())
> merged <- rbind(merged,dfn)
> }
>
> print(paste(i, nrow(merged)))
> } {code}
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)