[ https://issues.apache.org/jira/browse/ARROW-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Anthony Abate updated ARROW-6830: --------------------------------- Description: *Note:* Not sure if this is a limitation of the R library or the underlying C++ code: I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record batches of varying row sizes 1. Is it possible at to use *read_arrow* to filter out columns? (similar to how *read_feather* has a (col_select =... ) 2. Or is it possible using *RecordBatchFileReader* to filter columns? The only thing I seem to be able to do (please confirm if this is my only option) is loop over all record batches, select a single column at a time, and construct the data I need to pull out manually. ie like the following: {{for(i in 0:data_rbfr$num_record_batches) {}} {{ rbn <- data_rbfr$get_batch(i)}} {{ if (i == 0) }} {{ {}} {{ merged <- as.data.frame(rbn$column(5)$as_vector())}} {{ }}} {{ else }} {{ {}} {{ dfn <- as.data.frame(rbn$column(5)$as_vector())}} {{ merged <- rbind(merged,dfn)}} {{ }}} {{ print(paste(i, nrow(merged)))}} {{}}} was: *Note:* Not sure if this is a limitation of the R library or the underlying C++ code: I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record batches of varying row sizes 1. Is it possible at to use *read_arrow* to filter out columns? (similar to how *read_feather* has a (col_select =... ) 2. Or is it possible using *RecordBatchFileReader* to filter columns? The only thing I seem to be able to do (please confirm if this is my only option) is loop over all record batches, select a single column at a time, and construct the data I need to pull out manually. ie like the following: {{data_rbfr <- arrow::RecordBatchFileReader("arrowfile")}} {{for(i in 0:data_rbfr$num_record_batches) {}} {{ rbn <- data_rbfr$get_batch(i)}} {{ if (i == 0) }} {{ {}} {{ merged <- as.data.frame(rbn$column(5)$as_vector())}} {{ }}} {{ else }} {{ {}} {{ dfn <- as.data.frame(rbn$column(5)$as_vector())}} {{ merged <- rbind(merged,dfn)}} {{ }}} {{ }}} > Question / Feature Request- Select Subset of Columns in read_arrow > ------------------------------------------------------------------ > > Key: ARROW-6830 > URL: https://issues.apache.org/jira/browse/ARROW-6830 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, R > Reporter: Anthony Abate > Priority: Minor > > *Note:* Not sure if this is a limitation of the R library or the underlying > C++ code: > I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record > batches of varying row sizes > 1. Is it possible at to use *read_arrow* to filter out columns? (similar to > how *read_feather* has a (col_select =... ) > 2. Or is it possible using *RecordBatchFileReader* to filter columns? > > The only thing I seem to be able to do (please confirm if this is my only > option) is loop over all record batches, select a single column at a time, > and construct the data I need to pull out manually. ie like the following: > {{for(i in 0:data_rbfr$num_record_batches) {}} > {{ rbn <- data_rbfr$get_batch(i)}} > > {{ if (i == 0) }} > {{ {}} > {{ merged <- as.data.frame(rbn$column(5)$as_vector())}} > {{ }}} > {{ else }} > {{ {}} > {{ dfn <- as.data.frame(rbn$column(5)$as_vector())}} > {{ merged <- rbind(merged,dfn)}} > {{ }}} > > {{ print(paste(i, nrow(merged)))}} > {{}}} > > -- This message was sent by Atlassian Jira (v8.3.4#803005)