[ 
https://issues.apache.org/jira/browse/ARROW-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Abate updated ARROW-6830:
---------------------------------
    Description: 
*Note:*  Not sure if this is a limitation of the R library or the underlying 
C++ code:

I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record 
batches of varying row sizes

1. Is it possible at to use *read_arrow* to filter out columns?  (similar to 
how *read_feather* has a (col_select =... )

2. Or is it possible using *RecordBatchFileReader* to filter columns?

 

The only thing I seem to be able to do (please confirm if this is my only 
option) is loop over all record batches, select a single column at a time, and 
construct the data I need to pull out manually.  ie like the following:


{{for(i in 0:data_rbfr$num_record_batches) {}}
{{ rbn <- data_rbfr$get_batch(i)}}
 
{{ if (i == 0) }}
{{ {}}
{{ merged <- as.data.frame(rbn$column(5)$as_vector())}}
{{ }}}
{{ else }}
{{ {}}
{{ dfn <- as.data.frame(rbn$column(5)$as_vector())}}
{{ merged <- rbind(merged,dfn)}}
{{ }}}
 
{{ print(paste(i, nrow(merged)))}}
{{}}}

 

 

  was:
*Note:*  Not sure if this is a limitation of the R library or the underlying 
C++ code:

I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record 
batches of varying row sizes

1. Is it possible at to use *read_arrow* to filter out columns?  (similar to 
how *read_feather* has a (col_select =... )

2. Or is it possible using *RecordBatchFileReader* to filter columns?

 

The only thing I seem to be able to do (please confirm if this is my only 
option) is loop over all record batches, select a single column at a time, and 
construct the data I need to pull out manually.  ie like the following:

{{data_rbfr <- arrow::RecordBatchFileReader("arrowfile")}}

{{for(i in 0:data_rbfr$num_record_batches) {}}
{{  rbn <- data_rbfr$get_batch(i)}}
{{  if (i == 0) }}
{{ {}}
{{   merged <- as.data.frame(rbn$column(5)$as_vector())}}
{{ }}}
{{ else }}
{{ {}}
{{   dfn <- as.data.frame(rbn$column(5)$as_vector())}}
{{   merged <- rbind(merged,dfn)}}
{{ }}}
{{ }}}

 


> Question / Feature Request- Select Subset of Columns in read_arrow
> ------------------------------------------------------------------
>
>                 Key: ARROW-6830
>                 URL: https://issues.apache.org/jira/browse/ARROW-6830
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++, R
>            Reporter: Anthony Abate
>            Priority: Minor
>
> *Note:*  Not sure if this is a limitation of the R library or the underlying 
> C++ code:
> I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record 
> batches of varying row sizes
> 1. Is it possible at to use *read_arrow* to filter out columns?  (similar to 
> how *read_feather* has a (col_select =... )
> 2. Or is it possible using *RecordBatchFileReader* to filter columns?
>  
> The only thing I seem to be able to do (please confirm if this is my only 
> option) is loop over all record batches, select a single column at a time, 
> and construct the data I need to pull out manually.  ie like the following:
> {{for(i in 0:data_rbfr$num_record_batches) {}}
> {{ rbn <- data_rbfr$get_batch(i)}}
>  
> {{ if (i == 0) }}
> {{ {}}
> {{ merged <- as.data.frame(rbn$column(5)$as_vector())}}
> {{ }}}
> {{ else }}
> {{ {}}
> {{ dfn <- as.data.frame(rbn$column(5)$as_vector())}}
> {{ merged <- rbind(merged,dfn)}}
> {{ }}}
>  
> {{ print(paste(i, nrow(merged)))}}
> {{}}}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to