[ 
https://issues.apache.org/jira/browse/ARROW-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8118:
----------------------------------
    Labels: features pull-request-available  (was: features)

> [R] dim method for FileSystemDataset
> ------------------------------------
>
>                 Key: ARROW-8118
>                 URL: https://issues.apache.org/jira/browse/ARROW-8118
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>            Reporter: Sam Albers
>            Priority: Minor
>              Labels: features, pull-request-available
>
> I been using this function enough that I wonder if a) would be useful in the 
> package and b) whether this is something you think is worth working on. The 
> basic problem is that if you have a hierarchical file structure that 
> accommodates using open_dataset, it is definitely useful to know the amount 
> of data you are dealing with. My idea is that 'FileSystemDataset' would have 
> dim, nrow and ncol methods. Here is how I've been using it:
> {code:java}
> library(arrow)
> x <- open_dataset("data/rivers-data/", partitioning = c("prov", "month"))
> dim_arrow <- function(x) {
>  rows <- sum(purrr::map_dbl(x$files, 
> ~ParquetFileReader$create(.x)$ReadTable()$num_rows))
>  cols <- x$schema$num_fields
>  
>  c(rows, cols)
> }
> dim_arrow(x)
> #> [1] 426929 7
> {code}
>  
> Ideally this would work on arrow_dplyr_query objects as well but I haven't 
> quite figured out how that filters based on the partitioning variables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to